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Histopathology is insufficient to predict disease progression and clinical outcome in lung adeno- 
carcinoma. Here we show that gene-expression profiles based on microarray analysis can be 
used to predict patient survival in early- stage lung adenocarcinomas. Genes most related to sur- 
vival were identified with univariate Cox analysis. Using either two equivalent but independent 
training and testing sets, or 'leave-one-out' cross-validation analysis with all tumors, a risk index 
based on the top 50 genes identified low-risk and high-risk stage I lung adenocarcinomas, which 
differed significantly with respect to survival. This risk index was then validated using an inde- 
pendent sample of lung adenocarcinomas that predicted high- and low-risk groups. This index 
included genes not previously associated with survival. The identification of a set of genes that 
predict survival in early-stage lung adenocarcinoma allows delineation of a high-risk group that 
may benefit from adjuvant therapy. 



Lung cancer remains the leading cause of cancer death in indus- 
trialized countries. Most patients with non-small cell lung can- 
cer (NSCLC) present with advanced disease, and despite recent 
advances in multi-modality therapy, the overall 10-year survival 
rate remains a dismal 8-1 0%'. However, a significant minority of 
patients (-25-30%) with NSCLC have stage I disease and receive 
surgical intervention alone. Although 35-50% of patients with 
stage I disease will relapse within 5 years 7 '*, it is not currently 
possible to identify specific high-risk patients. 

Adenocarcinoma is currently the predominant histological 
subtype of NSCLC (rels. 1,5,6). Although morphological assess- 
ment of lung carcinomas can roughly stratify patients, there is a 
need to identify patients at high risk for recurrent or metastatic 
disease. Preoperative variables that affect survival of patients 
with NSCLC have been identified 7 ' ,0 . Tumor size, vascular inva- 
sion, poor differentiation, high tumor-pioliferative index and 
several genetic alterations, including K-ras (refs. 11,12) and p53 
(refs. 10,13) mutations, have prognostic significance. Multiple 
independently assessed genes or gene products have also been 
investigated to better predict patient prognosis in lung can- 
cer N ". Technologies that simultaneously analyze the expression 
ol thousands of genes" can be used to correlate gene-expression 
patterns with numerous clinical parameters— including paiient 
outcome— to better predict tumor behavior in individual pa- 
tients 70 . Analyses of lung cancers using array technologies have 
identified subgroups ol tumois that differ according to tumor 
type and histological subclasses and, to a lesser extent, survival 
among adenocarcinoma patients 71 " Here we correlated gene- 
expression profiles with clinical outcome in a cohort of patients 
with lung adenocarcinoma and identified specific genes that 



predict survival among patients with stage I disease. For further 
validation, we also show that the risk index predicted survival in 
an independent cohort of stage 1 Jung adenocarcinomas. 

Hierarchical profile clustering yields three tumor subsets 
Using oligonucleotide arrays, we generated gene-expression pro- 
files for 86 primary lung adenocarcinomas, including 67 stage 1 
and 19 stage 111 tumors, as well as 10 non neoplastic lung sam- 
ples. Selected sample replicates showed high correlation among 
coefficients and reliable reproducibility. We determined tran- 
script abundance using a custom algorithm and the data set was 
trimmed of genes expressed at extremely low levels, that is; 
genes were excluded if the measure of their 75th percentile value 
was less than 100. Although potentially resulting in the loss of 
some information, trimming in this manner decreased the possi- 
bility that the clustering algorithm would be strongly influenced 
by genes with little or no expression in these samples. 
Hierarchical clustering with the resulting 4,966 genes yielded 3 
dusters of tumors (Fig. 1). All 10 non neoplastic samples clus- 
tered lightly together within Cluster 1 (data not shown). We ex- 
amined the relationships between cluster and patient and tumor 
characteristics (Fig. 1 and Supplementary Figure A online). There 
were associations between cluster and stage {P - 0.030) and be- 
tween cluster and differentiation [P- 0.01). Cluster 1 contained 
the greatest percentage (42.8%) of well differentiated tumors, 
followed by Cluster 2 (27%) and Cluster 3 (4.7%). Cluster 3 con- 
tained the highest percentage of both poorly differentiated 
(0 7.6%) and Mage HI tumois (42.8%), yet contained 3 (14.3%) 
moderately differentiated and 1 (5%) well differentiated stage 1 
tumor. Notably. 1 1 stage I tumors were present in Cluster 3, sug- 
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gesting a common gene-expression profile for 
this subset of stage 1 and stage III rumors. 

For patients with stage J and stage HI tumois, 
the average ages were 68.1 and 64.5 years and 
the percentage of smokers was 88.9% and 
89.5%, respectively. Marginally significant as- 
sociations between cluster and smoking his- 
tory were observed {P = 0.06). A significant 
relationship between histopathological classifi- 
cation and cluster was only discern able for 
bronchioloalveolar adenocarcinomas (BAs), 
which were only present in Clusters 1 and 2 
(? = 0.0055) and comprised 35.7% and 12.3% 
of tumors for Clusters 1 and 2, respectively. _______ 

We examined the heterogeneity in gene-ex- 
pression profiles based on the trimmed data. set among normal 
lung samples and stage 1 and stage 111 adenocarcinomas by calcu- 
lating correlation coefficients between all pairs of samples. In 
contrast to normal lung samples that displayed highly similar 
gene-expression profiles (median correlation, 0.9), both stage I 
and 11) lung tumors demonstrated much greater heterogeneity in 
their expression profiles with lower correlation coefficients (me- 
dian values, 0.82 and 0.79, respectively). 

Northern-blot and immunohistochemistry analyses 
Of Ihe 4,966 genes examined, 967 differed significantly between 
stage I and 111 adenocarcinomas, a number in excess of that ex- 
pected by chance alone (248 at alpha level (a) = 0.05). Three 
genes were arbitrarily selected to verify the microarray expression 
data. The mRNA from 20 of the normal lung and rumor samples 
was examined by northern-blot hybridization with probes for in- 
sulin-like growth factor-binding protein 3 (JGFBP3), cystatin C 
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Fig. 1 Unsupervised classification analysis of lung adenocarcinomas. 3 classes of tumors identi- 
fied by agglomerauVe hierarchical clustering of gene-expression profiles using the 4,966 expressed 
genes. Patient and histopathological information for each lung adenocarcinoma case by cluster 
designation and methods for K-ros 1 2/13th-codon mutational status and nuclear p53 protein ao 
cumulation are provided (Supplementary Figure A online). TN classification denotes information 
regarding patient tumor size and nodal involvement. Associations between cluster membership 
and patient or histopathological variables are indicated at significance level (Ps 0.05). 



arid lactate dehydrogenase A {LDH-A) (Fig. 2a). Two gene probes 
not represented on the microarrays were used as controls, includ- 
ing histone H4, a potential index of overall cell proliferation, and 
28S ribosomal RNA, a control for sample loading and transfer. 
The relative amounts of IGFBP3, cystatin C and LDH-A mRNA 
strongly correlated with microarray- based measurements (Fig. 
2b). In both assays, JGFBP3 and LDH A mRNA levels increased 
from stage 1 to stage 111 adenocarcinomas and were higher than 
those in normal lung. Cystatin C mRNA levels were more variable 
but relatively greater in normal lung than tumors. These results 
suggest that the oligonucleotide microarrays provided reliable 
measures of gene expression. The tumors showed slightly greater 
histone H4 expression than the normal lung, likely reflecting in- 
creased proliferation of tumor cells. 

Immunohistochemistry was performed foi 1GFBP3, cystatin C 
and HSP-70 to determine whether mRNA over expression was re- 
flected by an inoease of their corresponding proteins in tumors. 
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Fig. 2 Validation analyses of gene-e>pres- 
sion pioiilinq. n. Northern- blot analysis of 
selected candidate genes for verification- of 
data obtained fiom oligonucleotide arrays. 
The same sample RNA for the A uninvolved 
lung, B stage I and 8 stage Ml tumors was 

used for Ihe northern-blot and oligonucleotide array analyses. 
b, Couelation analysis ol quantitative data obtained from oligonucleotide 
auays and northern, blots measuied by integrated phosphor imagei -based 
signals for the /G/BP3 and IDH-A genes. The ratio of ICfBP3, cystatin C 
and IDH-A mRNA to 285 iRNA was determined. The relative values foi 
each gene Irom each sample ate shown, n, non neoplastic normal lung; 
1. stage I tumors; 3. stage III tumors. <, Immunohistochemical analysis ol 
IGfBP-3, MSP- 70 and cystatin C in lung and lung adenocarcinomas. 
Cytoplasmic ICf BP- 3 inimunoreactivity in a neoplastic gland (tumor 122) 
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Cystatin C 



wijh piominent apical staining (blue reactant staining, aiiow, upper left). 
Difi use cytoplasmic HSP-70 immunoieactivity (tumoi 127), yet stromal el- 
ements show no reactivily (upper right). Normal lung parenchyma (lower 
fell) shows cytoplasmic cystatin C immunoieactivity in alveolar pneumo- 
cytes (airow) and intra-alveolai macrophages but tumor (190) shows dif- 
fuse cytoplasmic cystatin C immunoieactivity with prominent apical 
staining (lower right). Magnification, >?00 
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Immunoieactivity for bolh JGFBP-3 and HSP-70 (Fig. 2c) was de- 
tected in the cytoplasm o/ the adenocaicinomas, with little de- 
tectable reactivity in the stromal or inflammatory ceils. Cystatin 
C was detected in alveolar pneumocytes and intra-alveolar 
macrophages in non neoplastic lung parenchyma and also con- 
sistently in the cytoplasm of neoplastic cells. 

Gene-expression profiles predict survival 

As expected, Kaplan-Meier survival curves (Fig. 3o) and log-rank 
tests indicated poorer survival among stage 11] compared with 
stage I adenocarcinomas (P = <0.0001). Two statistical ap- 
proaches were used to determine whether gene-expression pio- 
/iles could piedict survival using the data set of 4,966 genes. In 
one approach, equal numbers of randomly assigned stage I and 
stage III tumois constituted training (n = 43) and testing (n = 43) 
sets. In the training set, the top 10, 20, 50 or 75 genes were used 
to create risk indices that were evaluated for their association 
with survival using the 50th, 60th or 70th percentile cutoff 
points to categorize patients into high or low groups. The results 
were similar across cutoJJ points but the 50-gene risk index had 
the best overall association with survival in the training set. 

818 



Fig. 3 Gene-expression profiles and patient survival. o t Relationship be- 
tween tumor stage and patient survival (stage 1 and stage 3 differ signifi- 
cantly, P< 0.0001). b. Relationship between the survival in the 43 test 
samples and their risk assignments based on the 50-gene risk index esti- 
mated in the 43 training samples. The high- and low-risk groups differ sig- 
nificantly (P ~ 0.024). c, Relationship between patient survival and the risk 
assignments in test samples (in b) conditional for tumor stage. The high- 
and low-risk stage I groups differ significantly (P =0.028), whefeas stage III 
low- and high-risk groups did not <P= 0.634). d, Relationship between sur- 
vival in the test cases and their risk assignments based on the 86 leave-one- 
out' cross-validation of the 50-gene risk index. The high- and low-risk 
groups differ significantly (P= 0.0006). t, Relationship between test case's 
risk assignment and survival (in d) conditional on tumor stage. The high- 
and low-risk stage I lung adenocarcinoma groups differ significantly from 
each other (P = 0.003), whereas low- and high-risk stage 111 tumors do not. 
f r Relationship between tumor class identified by hierarchical clustering and 
patient survival. Survival for patients in Cluster 3 differed relative to the tu- 
mors in Cluster 2 (P- 0.037) and approached significance for Cluster 1 and 
2 combined (P = 0.06). g, Analysis of the Michigan- based risk index using 
top cross-validated survival genes identify a low- and high-risk group in an 
independent cohort of 84 Massachusetts- based lung adenocarciriomas that 
are significantly different <P= 0.003). h, Among the 62 stage I lung adeno- 
carcinomas in the Massachusetts sample, the high- and Jow-risk groups dif- 
fered significantly (P = 0. 006). 

After conservatively choosing the 60th percentile cutoff point 
from the training set, we then applied this risk index and cutoff 
point to the testing set. The risk index of the top 50 genes cor- 
rectly identified low- and high-risk individuals within the inde- 
pendent testing set {P = 0.024) (Fig. ib and Supplementary 
Methods online). Notably, 13 stage I tumors were included in 
the high-risk subgroup. When this risk assignment was then 
conditionally examined for stage progression (Fig. 3c), low- and 
high-risk groups among stage 1 tumors were found to differ (P = 
0.028) in their survival. 

Identification of a robust set of survival genes 
Although predictive of patient survival, a single training-testing 
set may not provide the most robust set of genes due to random 
sampling issues. Therefore, a 'leave-one-out' cross-validation ap- 
proach was used to identify genes associated with survival from 
all 86-turnor samples. We fin\ developed a 50-gene risk index in 
each training set, and then applied the risk index to the test case 
held out from the full set of tumors and assigned the held out 
tumor to the high- or Jow-risk groups (Fig. 2d). The high and 
low-iisk subgroups determined in the test cases differed signifi- 
cantly in their overall survival (P = 0.0006). Among the larger 
group of stage I lung adenocarcinomas, the Jow-risk (n = 46) and 
high-risk (n = 2 J) groups had markedly different survival (P = 
0.003) (Fig. 3e). Table 1 lists selected examples of the cumulative 
top 100 genes derived from this cross-validation procedure 
(complete list in Supplementary Table A online). 

it was also noted that many of the stage t patients in the high- 
iisk subgroup (Fig. 3c) were present in Cluster 3 (Fig. ]). 
Kaplan- Meter analysis (Fig. 3/) demonstrated a significantly 
worse survival (P = 0.037) lor patients in Cluster 3 relative to pa- 
tients in Cluster 2 and approaching significance for Cluster ] 
and 2 combined (P = 0.06). This further indicates the important 
relationship between gene- expression profiles and patient sur- 
vival, independent of disease stage. 

Consistent with previous analyses of lung adenocarcinomas 73 , 
40% of stage 1 and 57. S% of stage II) tumors had 12th or !3th 
codon rwus gene mutations. Those patients with tumors con- 
taining K rai mutations showed a trend of poorer survival, but 
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Table 1 Selected examples of the top 100 genes from cross-validation 


Cene name 
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% Change in 


Coefficient 


Unigene comment 




(normal versus 


. Change in tomor 


(stage 1 versus 


stage III 








tomor t-test) 




stage III Mest) 




















Apoptosis-related 


CASP4 


036 


-6% 


0.02 


57% 


0.0022 


Caspase 4, apoptosis- 


P63 












relaled cysteine protease 


9.73E-04 


37% 


0.03 


43% 


0.0010 


. Transmembrane protein <63 kD), 














endoplasmic reticulum/ 














Colgi intermediate compartment 














Cell adhesion and structure 


KRT7 


8.02E-08 


126% 


0.11 


55% 


0.0003 


Keratin 7 


LAMB! 


0.14 


-20% 


0.01 


60% 


0.0027 


Laminin, £1 














Cell cycle and growth regulators 


BMP2 


0.54 


-21% 


0.27 


47% 


0.0044 


Bone morphogenetic protein 2 


CDC6 


1.3U-05 


1070% 


0.05 


148% 


0.0124 


CDC6 (ceD division cycle 6, 














Socchoromycei cerevbtoe homolog) 


S100P 


2.10E-08 


1572% 


0.19 


77% 


0.0001 


SI 00 calcium-binding protein P 


5ERPI1ME1 


2.89E-03 


72% 


0.25 


30% 


0.0008 


Serine (or cysteine) proteinase inhibitor. 














clade E {nexin). 


5TX1A 


8.65E-08 


54% 


0.07 


26% 


0.0031 


Syntaxin 1A (brain) 














Cell signaling 


ADM 


0.05 


39% 


0.04 


117% 


0.0016 


adrenomedullin 


AKAP 12 


8:53E-03 


-47% 


0.05 


214% 


0.0010 


A kinase (PRKA) anchor protein (gravin) 1 2 


ARHE 


0.06 


-39% 


0.05 


87% 


0.0092 


ras homolog gene family, member E 


CRB7 


2.02E-03 


38% 


0.63 


15% 


0.0030 


Growth factor receptor-bound protein 7 


VEGF 


6.50E-08 


. 174% 


0.02 


85% 


0.001 3 


Vascular endothelial growth factor 


WNT10B 


0.05 


31%' 


0.48 


20% 


0.0022 


Wingless-type MMTV integration site family. 














member 1 0B 








9/01 E-04 






Chaperones 


H5PA8 


0.36 


8% 


51% 


0.0008 


Heat-shock 70 kD protein 8 














Receptors 


r D RD 7 

t Kt>bz 


n (\a 

U.U** 


y i. / o 


n ^7 


1 Z\J A> 


v.VV i i 


v-erb-b2 avian erythroblastic leukemia viral 


FXYD3 












oncogene homolog 2 


0.10 


111% 


0.31 


73% 


0 0046 


' aiu uomaio-i oniarning ton iranspon 














regulator 3 


51C20A1 


1.34E-03 


58% 


0.02 


66% . 


0.0021 


Solute carrier family 20 (phosphate 














transporter), member 1 














fcniymes, cellular metabolism 


CSTB 


" 1.57F.-04 


50% 


0.15 


34% 


0.0001 


f vit at in R /tfpfin R\ ' * • 

V- yjlyull [> \MCIIII D J 


CTSL 


0.48 


-10% 


0.03 


67% 


0.0007 


oil itpilF f 1. 


CVP24 


3.16E-06 


N/A 


0.97 


2% 


0.0008 


Cytochrome P450, subfamily XXIV 














(vitamin D 24- hydroxylase) 


FUT3 


1.07E-07 


114% 


0.97 


-1% 


0.0033 


Fucosyltransferase 3 (gaiactoside 3(4) I - 














lucosy It ransf erase, lewis blood group included) 


MLN64 ■ 


020 


' 32% 


0.42 


80% 


0.0007 


Steroidogenic acute regulatory protein related 


PDE7A 


0.12 


33% 


0.01 


-35% 


-0.01 87 




PLCl 


0.04 


-68% 


0.35 


-170% 


-0.0011 


Plasminogen like 


5LC1A6 


0.07 


-32% 


0.12 


86% 


O.O069 


Solute carrier family 1 (high- affinity aspartate/ 














glutamate transporter), member 6 














Transcription and translation 


C0PF.B 


0.10 


-33% 


0.26 


75% 


0.0016 


Core promoter element binding protein 


CRK 


0.10 


32% 


0.03 


48% 


0.0098 


v-crk avian sarcoma virus 0 10 oncogene 














homolog 


RE LA 


0.26 


-7% 


0.01 


20% 


0.0034 


v-rel avian retrculoendotheliosis viral 














oncogene homolog A 














Unknown function 


KIAA0005 


2.2U-04 


40% 


0.02 


45% 


0.0010 


KIAA0005 gene product 


MCB1 


0.27 


125% 


0.33 


459% 


0.0018 


Mammaglobin 1 


Botded genei were also >ignihranl (o» 


survival in 43 tumor Iraintr 


>g set (fig. 3b). 









Table 1 Selected examples of the cumulative top 100 genes identified using 
training- testing, cross-validation of all 86 lung tumor samples. The percent 
change, as well as the direction, for the average values of the 1 0 non- neoplastic 
lung to all tumors, and lor the 67 Stage I to the 1 9 stage III tumors are shown. A 
positive coefficient 13 value is indicative of a relationship of gene expression to a 



poorer patient outcome. The genes are listed in potential functional categories. 
Genes that were also present in the top 50 survival genes using the 43-tumor 
Training set (Fig. 3o) are indicated in bold type. Complete listing of the gene 
probe sets and annotated gene and unigene identifiers can be found in the 
Supplementary Methods. 
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Fig. 4 Gene expression patterns of lop survival gene* a, Gene- expression patterns de- 
termined using 3gglomerative hierarchical clustering of the 86 lung adenocarcinomas 
against ihe 100 survival- 1 elated genes (Table 1) identified by the liaining- testing, cross- 
validation analysis. Substantially elevated (fed) or decreased (green) expression ol Ihe 
genes is observed in individual tumors. 5ome tumors (btocV arrow and expanded area) 
show extremely elevated e>pressk>n ol specific genes, b. An outlier gene-expression pat- 
tern [>b times the interquartile range among all samples) is observed tor the rrfcB? and 
flfoT A genes (top left and right, respectively). The SWOP and <tk genes (bottom left and 
right, respectively) show a graded pattern of expression related to patient survival. O, 
olive; dead (also in c). c, Ihe number ol outliers per person identified in the top 100 
genes plotted by survival distribution. 
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this difference did not reach statistical significance among all 
patients {P = 0.25), between patients within tumor clusters {P = 
0.41) or when analyzed separately among stage I (i> = 0.22) and 
stage 111 (P= 0.53) patients. Nuclear accumulation of p53 was de- 
tected in 17.9% stage 1 and in 22.2% stage 1)1 tumors. No signifi- 
. cant, relationship was observed for p53 staining and patient 
survival, cluster or tumor stage. 

Confirmation using an independent set of adenocarcinomas 
The robustness of our 50-gene risk index in predicting survival in 
lung adenocarcinomas was tested using oligonucleotide gene-ex- 
pression data Obtained from a completely independent 
(Massachusetts-based) sample of 84 lung adenocarcinomas (62 
stage 1, 14 stage II and 8 stage 111; ref. 21, and dataset A at 
www.genome.wi.mit.edu/MPR/Iung). To ensure equivalent 
power for testing and comparability of samples, the criteria for 
including tumors in the analysis were 4Q% or greater tumor cellu- 
larity, no mixed histology (that is, adenosquamous) and patient 
survival information. To obtain comparative gene-expression 
measures between the two data sets, gene sequences present on 
the U95A and HuGeneFL array were examined, and expression 
dat3 for our top 50 cross-validation genes for all 84 Massachusetts 
samples were obtained and processed" (see also Supplementary 
Methods online ). When we examined the risk assignment of 
these 84 samples, employing the identical cutoff point used for 
the 86 Michigan-based lung samples, we observed low- and high- 
risk groups (Fig. 3j; P = 0.003). Notably", among the 62 stage I tu- 
mors, high- and low-risk groups were observed that differed 
significantly (i>= 0.006) in their survival (Fig. 3J?). 

Survival genes had graded and outlier expression patterns 

A statistical and graphical analysis of the 100 survival-related 



genes (table 1) clustered against all 86 tumors revealed individ- 
ual rumors with substantially elevated expression in both a lim- 
ited and larger number of genes (Fig. 4a). Among these genes, we 
observed two distinct patterns of expression related to patient 
survival. One pattern, designated 'outlier', included genes show- 
ing substantially elevated expression (greater than five times the 
interquartile range among all samples), whereas the other pat- 
tern, designated 'graded', was characterized by continuously dis- 
tributed expression with patient survival (Fig. 4b). The erbBZ and 
ReslA genes are examples of outlier expression patterns and 
Si OOP and crk genes of graded patterns. The number of outliers 
per person in the top 100 genes was identified and plotted ac- 
cording to survival times and events (Fig. 4c). Both stage I and 
stage 111 lung adenocarcinomas showed outlier gene patterns 
and 1 0 tumors contained 3 or more outlier genes. 

Because gene amplification may result in increased gene ex- 
pression, the nine genes with outlier expression patterns (erbB2, 
SLC1A6, Wnt 1, MGB1 , RcglA, Atopy!, PACE, CYP24, KYNU) 
and one gene with a graded expression pattern (W?T18) were ex- 
amined using quantitative genomic PCR to evaluate genomic 
copy number (Fig. So). Gene amplification of erbBZ (17ql2) was 
detected in tumor L94, which had the highest erbBZ mRNA ex- 
pression (Fig. 4a). Gene amplification was not detected for any 
of the other seven tested genes in tumor L94, as well as in other 
tumors. The two genes most frequently demonstrating the out- 
lier pattern in these lung adenocarcinomas were KYNU and 
CYPZ4, and were present in 10 and 9 tumors, respectively. 
XYP24 has been described as a gene amplified and overexpressed 
in breast cancer 25 , and these results indicate elevated expression 
in lung adenocarcinoma. 

To determine Whether the graded or outlier gene-expression 
patterns also occur at the protein-expression level, l6of the 100 
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Tig. 5 Gene amplification and piotein expression of survival- 1 elated genes. 
o. Analysis of potential gene amplification lor 9 genes showing outlier expres- 
sion patterns *m the lung tumors (erbB2, SIC1A.6, Wnt 1. MGB1, Reg} A, 
AKAPW, PACE, CYP24 and KYNU) and examined using quantitative genomic 
PCR. A gene showing graded expression pattern (URTiB), and or>e gone 
{PACEA) with a similar chromosome location as PACE, were used as controls. 
Only crbB2 and ReglA are shown. An esophageal adenocarcinoma with 
known high- level genomic amplification of <voB2 was used as a positive con- 
trol and normal esophagus DNA w3s used as a negative control (Ctl). PCR 
fragments sizes weie 343 bp for CAPDH, 166 bp for r/oB? and 176 bp for 



Reg} A. DNA is trom normal lung (N) and tumcwfT) from each patient (for ex- 
ample 137). b, Immunohtstoc hemic at analysis of survival related genes with 
lung adenocarcinoma microarrays using the tumois from this study. The 
transmembrane erbB2 protein (top leti) expression is substantially increased 
in tumor 194 containing the amplilied erfrB2 gene (Fig. Ao and 6). Expression 
ol VtCr (top right) and S7 00P (bottom left) was located within the neoplas- 
tic cells and the pattern of immunoreactivity was consistent with the graded 
expression pattern demonstrated by their mRNA profiles. Expression of the. 
oncogene crk (bottom right) was abundantly expressed in neoplastic lung 
cells. Magnification, *400 (erbB?); *200 (VEGf, S100P and crk). 
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top survival genes (Table 1) for which specific antibodies weie 
available were chosen for immunohistochemical analysis using 
Jung-tumor arrays from this study (Fig. Sb). Expression of mem- 
brane erbB2 protein was substantially increased in the er£>B2-am- 
plified tumor L94 and very iow levels of expression were present 
in other tumors, consistent with mRNA- express} on measure- 
ments (Fig. 4a and b). CDC6 protein expression was also sub- 
stantially higher in tumor L94, consistent with mRNA levels 
(data not shown). Expression of vascular endothelial growth fac- 
tor (VEGF) and SI OOP (Fig. 5b), as well as cytokeratin 18 (KRT38), 
cytokeratin 7 (KRT7) and fas- associated death domain (FADD) 
protein (data not shown), was located within the lung tumor 
cells and consistent with the graded expression pattern of the 
mRNA profiles. The oncogene crk showed both graded mRNA as 
well as a graded protein- expression pattern with survival, and 
was abundantly expressed in the tumor cells {Fig. Sb). These re- 
sults indicate that many survival- associated genes are expressed 
at the protein level and demonstrate similar mRNA and protein- 
expression patterns. 

Discussion 

We used several approaches for the analysis of gene- expression 
data related to clinicopathologieal variables and patient sur- 
vival. One approach, hierarchical clustering, was used to exam- 
ine similarities among lung adenocarcinomas in their patterns 
of gene expression. Previous studies of lung tumors 21 - 22 have also 
used this method to describe subclasses of lung tumors. Here, 
we found three clusters that showed significant differences with 
respect to tumor stage and tumor differentiation. This suggests, 
as expected, that tumors with similar histological features of 
differentiation demonstrate similarities in gene expression. 
This feature also partly underlies the observed statistical associ- 
ation of tumor stage and cluster, as many of the higher-stage tu- 
mors, often poorly differentiated and previously associated 
with a leduced survival" 0 , were located in Cluster 3. Although 
this cluster contained the highest percentage of stage 111 tu- 
mors, it also contained a nearly equal mixture of stage I and 
stage 111 tumors and not all tumors were poorly differentiated. 
This indicates that a subset of stage 1 lung adenocarcinomas 
share gene-expiession profiles with higher-stage tumors. 
Notably, 10 of the 1 1 stage 1 tumors found in Cluster 3 were the 
high-risk stage 1 tumors identified using the risk index in the 
'leave-one-out' cross-validation. 

In contrast to previous analyses of lung adenocarcinomas 21,22 , 
we validated the expression data from the arrays. The strong cor- 
relation of northern-blot analysis and oligonucleotide-array data 
for gene expression in the same samples (Fig. 2b) indicates that 
these studies provide robust gene- expression estimates. 
Jmmunohistocbemistry using the same tumor samples in tissue 
arrays demonstrates protein expression within the lung tumor 
cells. Together, these studies indicate that many of the genes 
identified using gene-expression profiles are likely relevant to . 
lung adenocaicinoma. For example, JGFBP3 gene expression is 
increased in lung adenocarcinomas (Fig. 20. IGFBP3 protein 
modulates the autocrine, or paracrine ellects ol insulin-like 
growth factors, elevated IGFBP3 expression is observed in colon 
cancer*, and increased serum 1GFBP3 is associated with progres- 
sion in breast cancer 7 . Heat-shock protein 70 (MSP-70) is in- 
creased in lung adenocarcinomas ol smokers 7 " and is associated 
with increased metastatic potential in breast cancer" Increased 
serum lactate dehydrogenase is correlated with tumor stage and 
tumor burden 30 , and cyst at in C, a cysteine protease inhibitor ex- 



pressed in human lung cancers 3 *, is prognostic in some cancers 32 . 
The decreased expression of this protease inhibitor may affect 
the invasive properties of the tumor cell. 

The cross-validation analytical strategy we used is particularly 
informative for these types of gene-expression analyses for dis- 
ease outcome 33,34 , and identification of doss-validated genes with 
a larger tumor cohort may help refine this risk index for use in a 
clinical setting. The gene-expression data also provide opportune 
ties to observe overarching patterns that advance our under- 
standing of associations between genes and disease. For example, 
the top 100 survival genes include those involved in signaling, 
cell cycle and growth, transcription, translation and metabolism. 
Expression of many of these genes is likely a function of increased 
proliferation and metabolism in the more aggressive tumors. 
Some genes, such as erbhZ and FeglA (Fig. 4a and b), were highly 
overexpressed in a few patients having poor survival. In one 
tumor, the erbBZ gene was amplified (Fig. Sa), demonstrating that 
genomic changes may underlie the overexpression of a subset of 
these outlier genes. Immunohistochemistry confirmed protein 
overexpression in this patient's tumor (Fig. Sb). Notably, seven of 
the eight outlier genes were not amplified, indicating that other 
mechanisms underlie the increased mRNA expression of these 
survival-related genes. 

Most genes showed a graded relationship between expression 
and patient survival. Genes such as that encoding VEGF, known 
to be strongly associated with survival in Jung cancer 34 - 3 * were 
identified as related to patient survival in our study. VEGF 
demonstrated a graded expression pattern, as did the S100P and 
crk oncogene (Fig. Sb). SI OOP is a calcium-regulated protein not 
previously reported in lung cancer. The crk gene, the cellular ho- 
molog of the v crk oncogene, is a member of a family of adaptor 
proteins involved in signal transduction and interacts directly 
with c-jun N-terminal kinase 1 fJNKl) 37 . Although crk has not 
been shown to have a role lung cancer, its role in the MAP-ki- 
nase pathway, which leads to activation of matrix metallopro- 
teinase secretion and cell invasion 38 , indicates potential 
involvement in the the tumor cell invasion or metastasis of 
some lung adenocarcinomas. Among the many genes identified 
in this study, like crk, that may be causally involved in lung can- 
cer progression (Table 1), some were related to survival in many 
patients, and others in only smaller subsets of patients. This re- 
sult is consistent with the complex molecular architecture of tu- 
mors in general, the heterogeneity of lung adenocarcinomas in 
particular and the multiple mechanisms underlying tumor-cell 
survival, invasion and metastasis". 

Our results demonstrate that a gene- expression risk profile- 
based on the genes most associated with patient survival— can 
distinguish stage I lung adenocarcinomas and differentiate prog- 
noses. The particular genes that define the clusters, or are associ- 
ated with survival, likely reflect the characteristics of the 
particular tumors included in the analysis. Current therapy for 
patients with stage 1 disease usually consists of surgical resection 
without adjuvant treatment 23 . Clearly, the identification of a 
high-risk group among patients w\\h stage 1 disease would lead 
to consideration of additional therapeutic intervention for this 
group, possibly leading to improved survival of these patients. 

Methods 

Pniient popular ion. Sequential patients seen at the Univpisily 0 I Michigan 
Hospital between May 7 994 and |uly 7000 lor stage 1 oi stage III lung ade- 
nocaicinoma were evaluated for this study. Consent was received and the 
project was approved by the local Institutional Review Board. Primary tu- 
mors and adjacent non neoplastic lung tissue were obtained at the time ol 
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surgery. Peripheral portions of resected lung carcinomas were sectioned, 
evaluated by a study pathologist and compared with routine H&E sections 
of the same tumors, and utilized for mRNA isolation. Regions chosen for 
analysis contained a tumor. cellular ity greater than 70%, no mixed histol- 
ogy, potential metastatic origin, extensive lymphocytic infiltration or fibro- 
sis. Tumors were hislopathologicalfy divided into two categories based on 
their growth pattern: bronchial-derived, if they exhibited invasive features 
with architectural destruction, and bronchioloarveplar, if they exhibited 
preservation of the lung architecture. All stage I patients received only sur- 
gical resection with intra- thoracic nodal sampling and no other treatments: 
Mage III patients received surgical resection plus chemotherapy and radio- 
therapy. 

Gene-expression profiling and K-ros mutation analysis. RNA isolation, 
cRMA synthesis and gene-expression profiling were performed as de- . 
scribed 24 . Details of gene annotation and K-ros mutation analysis are pro- 
vided in supplementary information. 

Norlhern-blot analysis. Total cellular RNA (1 0 ug) was separated in 1 .2% 
agarose-! or maldehyde gels and vacuum- transferred to Gene Screen Plus 
(NLN Life Science Products, Boston, Massachusetts): Hybridization condi- 
tions and probe labeling were as described 40 . Individual sequence-validated 
cDNA image clones for human IGFBR3 (clone 1407750), IDH-A (clone 
2420241), cystatin C (CTS3; clone 949938) were from Research Genetics 
(Huntsville, Alabama). The human hist one H4 cDNA and the 28S ribosomal 
RNA 26-mer oligonucleotide probe were prepared and labeled as de- 
scribed 40 . 

Gene-amplification analysts. 1 1 genes were selected for the analysis of ge- 
nomic alterations. Primers were designed using PrimerSelect 4.05 Windows 
32 software (DNASTAR, Madison, Wisconsin), avoiding pseudogenes or po- 
tential homologous regions. Forward and reverse primers for the genes are 
provided (Supplementary Methods online). Quantitative genomic-PCR was 
then applied and analyzed as described 41 . 

Immunohistochemka! staining. The H&E-stained slides ol all primary 
lung tumors were used to identify the most representative regions ol each 
tumor and a tissue microarray (TMA) block was constructed as described 41 . 
Immunohislocbemistry (IHC) was performed using both routine and sec- 
tions Jrom the TMA block as described". Detailed methods and the con- 
centrations used for all antibodies are provided in the Supplementary 
Methods. 

Statistical methods. Mests were used to identify differences in mean gene- 
expression levels between comparison groups. Agglomerative hierarchical 
clustering 43 was applied using the average linkage method to investigate 
whether there was evidence lor natural groupings of tumor samples based 
on correlations between gene-expression profiles. To investigate the ro- 
bustness ol the clustering inference, gene- expression values were per- 
turbed by adding random Gaussian error of magnitude obtained from a 
duplicate sample to each data point and then reclusteied to determine con- 
cordance in the tumor's (lass membership. Pearson, x J and fisher's exact 
tests were used to assess whether cluster membership was associated with 
physical and genetic characteristics ol the tumors. 

To determine whether gene-expression profiles were associated with 
variability in survival times, 2 separate but complementary approaches 
were used. In the first approach, the 86 tumors were randomly assigned to 
equivalent training and testing sets consisting ol equal numbers of stage I 
and III tumors in order to validate a novel risk-index function that captured 
the effect ol many genes at once. In I he second approach, cross-validation 4 * 
was used to more robustly identify the genes associated with survival. 
Briefly, a Teave-one-out' cross-validation procedure in which 85 ol the 86 
tumors (the training set) was used to identify genes that were univarialely 
associated with survival. The risk index was defined as a linear combination 
of the gene- expression values for the top genes identified by univariate Cox 
proportional- hazard regression modeling 41 , weighted by their estimated re- 
gression coefficients. Kaplan-Meier survival plots and loo-rank tests were 
then used to assess whether the risk-index assignment to high/low cate- 
gories was validated in the test set. A more detailed description is provided 
{Supplementary Methods online). 



Not er Supplementary information is ovoiloble on the Nature Medicine website. 
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ABSTRACT 

Motivation: Protein abundance is related to mRNA 
expression through many different cellular processes. 
Up to now, there have been conflicting results on how 
correlated the levels of these two quantities are. Given that 
expression and abundance data are significantly more 
complex and noisy than the underlying genomic sequence 
intormalion, it is reasonable to simplify and average them 
in terms of broad proteorhic categories and features (e.g. 
functions or secondary structures), tor understanding their 
relationship. Furthermore, it will be essential to integrate, 
within a common framework, the results of many varied 
experiments by different investigators. This will allow one 
to survey the characteristics of highly expressed genes 
and proteins. 

Results: To this end, we outline a formalism tor merging 
and scaling many different gene expression and protein 
abundance data sets into a comprehensive reference 
set, and we develop an approach for analyzing this in 
terms ol broad categories, such as composition, function, 
structure and localization. As the various experiments are 
not always done using the same set of genes, sampling 
bias becomes a central issue, and our formalism is 
designed to explicitly show this and correct for it. We apply 
our formalism to the currently available gene expression 
and protein abundance data tor yeast. Overall, we found 
substantial agreement between gene expression and 
protein abundance, in terms ol the enrichment of structural 
and functional categories. This agreement, which was 
considerably greater than the simple correlation between 
these quantities for individual genes, reflects the way 
broad categories collect many individual measurements 
into simple, robust averages. In particular, we found 
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that in comparison to the population of genes in the 
yeasl genome, the cellular populations of transcripts and 
proteins (weighted by their respective abundances, the 
transcriptome and what we dub the translatome) were both 
enriched in: (i) the small amino acids Val, Gly, and Ala; 
(ii) low molecular weight proteins; (iii) helices and sheets 
relative to coils; (iv) cytoplasmic proteins relative to nuclear 
ones; and (v) proteins involved in "protein synthesis,' 'cell 
structure/ and 'energy production.' 
Supplementary information: http://genecensus.org/ 
expression/translatome 
Contact: mark.gerstein@yale.edu 

INTRODUCTION 

High throughput experimentation, measuring mRNA 
(Schena et al, 1995; Eisen and Brown, J 999; Ferea 
and Brown, J 999; Lipshutz et <?/., 1999) and protein 
expression (Anderson and SeiJhamer, 1997; Futcher et a!., 
1999; Gygi et al., 1999a; Ross-Macdonald et al, i999; 
Lopez, 2000; MacBeath and Schreiber, 2000; Nelson 
et al, 2000; Zfiu et al, 2000) are currently the single 
richest source of genomic information. However, how to 
best interpret this data is stiff an open question (Bassetl 
et al, J 996; Wines and Friedman, )999; Zhang, J 999; 
Gerstein and Jansen, 2000: Searls, 2000: Sherlock. 2000; 
Claveric, 1999; Einarson and Golemis, 2000; Epstein and 
Butow. 2000; Shapiro and Harris, 2000). Understanding 
how protein abundance is related to mRNA transcript 
levels is essential for interpreting gene expression, protein 
interactions, structures and functions in a cellular sys- 
tem (Baizimanikatis et al, 1999). Moreover, as protein 
concentration is the more relevant variable with respect 
to enzyme activity, ii connects genomics to the physical 
chemistry of the cell (Kidd et ol, 2001). Protein abun- 
dance may also be invaluable for diagnostics and for 
determining ilrue larcets f Corthals et al . 2000). 
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Previously, we surveyed the population of protein 
features — such as folds, amino acid composition, and 
functions — in yeast, and other recently sequenced 
genomes (Gerstein, 1997, 19?8a,b; Gerstein and Hegyi, 
J 998; Hegyi and Gerstein, 1999; Das and Gerstein, 
2000; Lin and Gerstein, 2000), and we extended this 
concept to compare the population of features in the 
yeast Uanscriptome to that in the genome (Drawid et 
ai, 2000; Jansen and Gerstein, 2000). Others have also 
done related work (Frishman and Mewes, 1997; Tatusov 
ei ah, 1997; Jones, 1998; Wallin and von Heijne, 1998; 
Frishman and Mewes, 1999; Wolf et ai, 1999). Here, we 
present a new methodology to compare the features of the 
mRNA expression population with the protein abundance 
population. 

Precise terminology is essential for this comparison. 
Unfortunately, 'proteome* is used inconsistently. Pro- 
teome can logically be used to describe all the distinct 
proteins in the genome (Qi et al y 1996; Cavalcoli et oL, 
1997; Fey et ai, 1997; Carrels et a!., 1997; GaasterJand, 
1999; Jones, 1999; Saii, 1999; Tekaia et ai, 1999; 
Bairoch, 2000; Cambillau and Claverie, 2000; Doolitiie, 
2000; Pandey and Mann, 2000; Rubin et ai, 2000) and, 
in this context, it is equivalent lo what others may refer 
to as the coding part of the genome. However, in papers 
on two-dimensional (2D) electrophoresis, it is often used 
to describe the sum total of proteins in a cell, taking 
into account the different levels of protein abundance 
(Shevchenko et ai, 1996; Gygi et oi y 2000a; Lopez, 
2000; Washburn and Yales, 2000). In an effort to be clear, 
we propose the term 'translatome' for this second usage 
of proteome. 

With this definition, we are able to refer compactly to 
three different cellular populations. These are illustrated 
in Figure 1. 

(i) We use the term genome when we refer to the 
population of open rending frames, where each ORF 
counts once. 

(ii) We use the term transcriptome when we refer to 
the population of mRNA irnnscriprs. This lerm was 
originally coined by Velculescu et ai (1997). Note 
that each ORF may give rise to different numbcis 
of iranscripis. Consequently, the transcriptome is 
essentially the same as the genome but with each 
ORF weighted by its expression level. 

(in) The next level is. the cellular population of proteins. 
As each protein represents a translated transcript, 
we make an analogy with the icim transcriptome 
and use the term translatome its described above 
to describe this third population. Thus, the trans- 
laiome is a subset of the genome where each 
ORF is weighted by its associated level of ptoteiii 
abundance. 



Note that one could also. less compactly call the trans- 
latome a 'weighted proteome/ However, doing so assumes 
one of the two aforementioned definitions of proteome. To 
avoid ambiguity, we studiously avoid the use of proteome 
altogether in the paper. 

Differences between the translatome and the transcrip- 
tome exist given that transcripts from different genes 
can give rise to different numbers of proteins, due to 
different rates of translation and protein degradation. 
Post-transcriptional modifications further affect the 
translatome. 

In our analysis of the transcriptome and translatome, we 
focus on global protein features rather than the compari- 
son of individual genes. Previous analyses have shown thai 
differences between mRNA expression and protein abun- 
dance levels can be quite dramatic for individual genes. 
This may either be due to the noise in the data or to funda- 
mental biological processes. However, our analyses show 
that the variation between transcriptome and translatome 
is much smaller for global properties that are computed by 
averaging over the properties of many individual genes. 

METHODS 

Da t a sources used 

For our analysis we culled many divergent data sets, 
representing protein abundance and mRNA expression 
experiments and also other sources of genome annotation. 
These are all summarized in Table 1 . 

Biases in the data 

The databases that annotate the specific genes may 
not always be accurate (Ishii et ai, 2000). Gene Chip 
experiments suffer with regard to cross hybridization 
and the saturation of probes. SAGE data degrades for 
lowly expressed mRNAs. 2D gels are unable to resolve 
membrane proteins (approximately 30% of the genome) 
and basic proteins (Gerstein, 1998c; Krogh et aL, 2001). 
In addition, the procedures for identification and quan- 
tification of the protein spots are subject to. uncertainties 
(Haynes and Yates, 2000). Human biases include the 
lack of low abundance proteins (Fey and Larsen, 2001; 
Gygi et ai, 2000b; Harry et ai, 2000) and the differences 
between laboratories in sample preparation. Our reference 
expression data set attempts to resolve these problems. 

Data set scaling 

A reference set for mRNA expression. With many differ- 
ent mRNA expression data sets available, it is worthwhile 
to integrate them into a single unified reference set, with 
the intention of reducing the noise and errors contained in 
the individual data sets and to obtain a unified estimate of 
the normal expression state in a cell. 
We adopt an iterative scaling and merging formalism. 
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Fig. 1. Schematic overview of rhe analysis. On the left-side we outline the terms we use to describe the process of gene expression. The 
coding section of the genome is transcribed into a population of mRNA transcripts called the 'transcriptome.' The transcripts in turn are 
translated to a population of proteins; we use the term translatome * for this protein population rather than the alternative 'proteome' because 
the latter term may be confounded with the protein complement of the genome (which is not necessarily associated with a quantitative 
abundance level). 

The matrix in the middle schematically shows an analysis of the three stages of expression. In genera], we define a protein 'population' as 
a set of genes associated with a corresponding number of expression or abundance levels ('weights'): In the matrix each row represents a 
weight and each column a gene set. In particular, we differentiate between the mRNA refeience expression set (G,nRjM A = Gc eD ). which 
essentially covers the complete genome, and the reference protein abundance set (Gp IO( ) which contains the proteins in data sets 2-DE #1 and 
2-DE #2 (see Tabic I) because the protein abundance set is a significantly smaller subset of the genome. By definition, this subset contains 
only proteins that can be identified by 2-D gel electrophoresis and is therefore biased in this sense. The enrichment figures throughout this 
paper, through a comparison of the right- and left- sides of this figure, show the results of the experimental biases of 2D gels on the data set. 
Each pie chart represents a composition of a particular protein feature F (for instance, an amino acid composition) in a population (represented 
by the symbol ji). We can further look ai the enrichment' of this feature in one population relative to another (represented by the symbol A, 
see Section 'Methods* for an explanation of the formalism). 



which we summarize below. We present a more, detailed 
review of lhe meihods on our web site. 

We starl with lhe values of one gene chip data sel Ui 
where i is used throughout as a subscript to denote gene 
number. We then transform the values of the next Gene 
Chip data set X, to with the following non- linear regres- 
sion: min£, (Y { - U,) 2 with - AXj* where A and B 
are the parameters of the regression*. Note thai two Gene 
Chip sets may not be defined for the same set of genes, 
so we have to perform the fit only over the genes com- 
mon to both sets. The motivation for scaling is that the 
dynamic range of observed expression levels varies some- 
what between different data sets : although cell types and 
growth conditions are very similar. Reasons lor dispar- 
ity may include different calibration procedures for relat- 
ing fluorescence intensity io a cellular concentration (mea- 
sured in copies of transcripts per cell) or different pro- 
tocols for harvesting and reveise- transcribing the cellular 
mRNA. 

We then merge and average the data to create a new 



reference sel V as follows: 

If Ui and Yi are both defined for gene # and — — < a 

Yi 4 U; 

Then Vi : = UY, f + (/,) 
Else if only Yi exists, V; = 
Else Vi = Us. 

As presented above, where only one data set has a value 
for the corresponding ORE, we incorporated that value 
and did not exclude it. When both data sets have values 
for an ORE. we averaged the values if they were within 
of each other; otherwise, we just stayed with the 
original chip data set U;. We used a — }5% in order io 
prevent outliers from skewing the result. This 15% value is 
a reasonable threshold for excluding outliers though other 
values (e.g. 10 or 20%) would give similar results (data 
not shown). Other data sets are subsequently included in 
the same procedure, continuing the iteration from the new 
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expression values V/. The initial iteration starts with the 
Young Expression Set, as U it sirice we have the highest 
confidence in its accuracy. 

The SAGE data (Velculescu et al, J 997) was not 
included in the above procedure syice it is of a funda- 
mentally different nature. An advantage of the SAGE 
technology over Gene Chips is that there is no possible 
signal saturation for high expression levels, as is possible 
for chips (Futcher et a/., 1999). Conversely, SAGE values 
are less reliable for lowly expressed genes since there 
is a chance that one might not sequence a SAGE tag 
corresponding to such a gene altogether. Therefore, if 
after the last iteration, the average Gene Chip expression 
level Vj was both above a certain threshold and below 
the SAGE expression level 5/ for the same gene, it was re- 
placed with the SAGE value; otherwise the average Gene 
Chip value was kept. This gave us our final expression 
set >v m R NA . Our treatment of the SAGE data is modeled 
after that in Futcher et al (1999), and like them, we used 
0=16. 

This incorporation of the SAGE data into the reference 
data set ensures that the highly expressed outliers are as 
accurate as possible. 

Rather than plain arithmetic averaging, this overall scal- 
ing procedure with the a cutoff avoids 'artificial averages' 
that combine very different values for a particular gene. 
Some expression values might be statistical outliers. In 
addition, it may be possible that the expression levels of 
a variety of genes can only be within mutually exclusive 
ranges or modes, such as when two alternative pathways 
are switched on or off. Simply averaging (hese would give 
values that are less representative of the particular mode 
values. This situation is analogous to that in averaging 
together an ensemble of protein structures (i.e from NMR 
structure determination). Each structure could be stereo- 
chemicaJly correct, with all side-chain atoms in predefined 
rotamer configurations. However, an average of all struc- 
tures could yield one that is stereochemical))' incorrect if 
this involved averaging over particular side-chains in dif- 
ferent rotameric stales. 

With regard to our regression analysis, we have investi- 
gated both non-linear and linear fits but found a non linear 
procedure to be more advantageous. The non- linear rela- 
tionship between different expression data seis peihaps 
reflects saturation in one or more of the Gene Chips — not 
an uncommon phenomenon. This non-linearity is imme- 
diately evident on scatter plots of two daia sets against 
one another (.see website). Accordingly, the non-linear 
fit produces a smaller residual than the linear fit: 9S297 
(non-)ineai) veisus 122 182 (linear) for the scaling of the. 
Church data set and 5982S (non-linear) versus 67 462 
(linear) for the Samson data set. 



A reference set for protein abundance. We followed a 
similar procedure to calculate a reference protein abun- 
dance set from the two gel electrophoresis data sets. We 
* first scaled the two data sets against the mRNA expres- 
sion reference data set, getting regression parameters Cj 
and Dj : 

i 

where the subscript / indicates the data set 2-DE #1 or 
2-DE #2 respectively; Pgj is the protein abundance value 
in data set j, and iu m RNA ; i the corresponding reference 
expression value, and Cj and Dj are the parameters of 
the non-linear regression. 

Using these parameters, we transformed the values of set 
2-DE #2 onto 2-DE #). Then we combined both sets into 
the reference protein set wp,- 0 | by averaging them, if both 
vaJues existed. Otherwise, by using the existing value, viz: 

,p \D t /D 2 

<" <:) 

wpiou •= (^i.i + 0u)/ 2 if both Pi\\ and Q i 2 exist. 
Else if only P t j exists, iup ro u = P-, y j 
Else if £,\2 exists, wp^., = Q i 2 . 

Enrichment of features 

Formalism. In the next part of our analysis, we want 
to group a number of proteins together into various 
categories based on common features and characterize 
those features that are enriched in one population relative 
to another, i.e. the translatome population of proteins 
as measured by 2D gels relative to the transcriptome 
population of transcripts or the genome population of 
genes. To this end, we set up a formalism that could 
be applied universally to all the attributes that we were 
interested in. Due to the limitations of the experiments, 
the translatome, transcriptome, and genome populations 
are defined on different sets of genes, and sometimes we 
want to remove this 'selection bias' by forcing them to be 
compared on exactly the same set of genes. This is a key 
aspect of our formalism as presented in Figure I . 

We call an entity like |>v, G) a 'population; where G 
is a set describing a particular selection of genes from the 
genome and w is vector of weights associated with each 
element of this population. In particular, we focus on three 
main populations here: 

U) I J - Corn) IJ > die population of genes in the genome, 
all 0280 genes weighted once (w — 3): 

(ii) I w m RNA : Gjprna} is the observed population of the 
transcripts in the transcriptome, i.e. the 6249 genes 
in the reference expression set weighted by their 
reference expression value: 
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(iii) [Wproi, GpjoJ is the observed cellular popuJation of 
the proteins in the translatome, i.e. the 181 genes 
in the reference abundance set weighted by their, 
reference abundance value. 

(The set of genes in the genome Goen js approximately 
equal to the genes in set G m RN A , such that we can use 
both symbols interchangeably.) We can also use this nota- 
. tion to describe specific experiments — e.g. fwjacz, Gj ac z] 
describes the gene set and weights relating to the transpo- 
son abundance set. 

Furthermore, we define Fj as the value of a feature F 
in ORF /. For example, F could be the composition 
of leucine (a real number) or a binary value (0 or 1) 
indicating whether an ORF contains a trans-membrane 
segment. Given these definitions, the weighted average of 
feature F in population {w, G] is: 



/i(F.[w, GJ) = 



The weighted averages of two populations [w, G] and 
[v, S) can be compared by simply looking at iheir relative 
difference A: 

A(F. [v, SI [w. CD = ^'^^-^.tw.CD 

^(F, |w. GJ) 

where v and >v are weights lor the sets of ORFs S and G 
respectively. We call A the 'enrichment' of feature F 
because it indicates whether F is enriched (if A is 
positive) or depleted (if A is negative) in population [v. S) 
relative to |w, GJ. 

Usually, the gene set G is defined by the particular 
experiment, lor which the weight w was measured. 
However, it is also possible to combine the gene set 
associated with one experiment with expression levels 
from another set. One may want to do this to compute 
the enrichment only on the genes common to both 
populations,, for which there are defined values for both w 
and v, viz: A{F, \v, S O G). \w, S H GJ). In practice, 
this is most relevant for comparing Gy ttH and G m ^ A . 
Since Gp lol is completely a subset of G mRNA , we need 
not explicitly deal with intersections if we calculate all 
statistics directly over Gp IOJ . 

One can adjust the weight vectors to take into account, 
different types of averaging. For instance, when com- 
puting the amino acid composition (F on) from the 
amino acid compositions of individual ORFs Fj ~ oaj 
(V; € G). we weight by ORF length. In the case of 
expression weights, we have: 

W j = NjW mK *A.j V> € G . 

where Nj is a measure of the length of ORF / (such as the 
number of amino acids). 



On the other hand, when computing the average molec- 
ular weight per amino acid, we need to normalize by the 
number of amino acids per ORF, which is equivalent to 
choosing the following weights: 



w; = 



WroRNAJ 
N; 



vy e G. 



Application of methodology to quantitative 
abundance sets 

Having defined our formalism, we applied it to a diverse 
set of protein features in yeast. 

Amino acid enrichment. As shown in Figure 2a, we used 
our methodology to measure the enrichment of individual 
amino acids in both the translatome and the transcriptome 
relative to the genome. We found that three amino acids— 
valine, glycine and alanine—were consistently enriched in 
both transcriptome and translatome populations. 

In Figure 2a we compare different gene Sets. In Fig- 
ure 2b we focus mainly on the variation in enrichments 
when all the comparisons are restricted to the set of 181 
genes (Gp rol O G m RNA = Gp rot ) common to all data sets. 
Thus, the differences between the populations now only 
reflect the effects of differential transcription of certain 
genes and differential translation of certain transcripts. 
We find here an enrichment specifically of cysteine in the 
translatome in relation to the transcriptome. 

To measure the statistical significance of the results on 
amino acid enrichment, we have performed a control anal- 
ysis on a randomized data set (Figure 2d). We randomly 
permutated the expression values of the ORFs 1000 times 
and then recomputed the enrichments. This allowed us to 
compute distributions for the amino acid enrichments and, 
from integrating these, one-sided ^-values indicating the 
significance of the observed enrichments. 

Amino acid enrichment in Transposon data set. We 
also tried to extend our methodology, ineffectively, fo 
cope with the semi-quantitative Transposon set. We used 
only those 450 ORFs that consistently yielded either no 
expression or high expression, as binary data, on or off. We 
show the enrichments of amino acids computed from this 
filtered Transposon abundance set in Figure 2a. Overall, 
the enrichments from this set seemed to be attenuated in 
comparison to other data. 

Biomass enrichment. A corollary to amino acid enrich- 
ments is the detennination of the average biomass of the 
transcriptome and translatome populations (shown in Fig- 
ure 2c). We found that the average molecular weight of 
a protein in both populations was. on average, lower than 
in the genome population. These preliminary observations 
suggest a cell preference to use less energetically expen- 
sive proteins for those that are highly transcribed or tians- 
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Fig. 2. Amino acid and biomass enrichment, (a) Shows ihe amino acid enrichments between different populations as indicated by the legend 
to the right of the plot (the legend is ordered in the same way as the schematic illustration in Figure I). The bars indicate ihe enrichment of the 
transcriptome relative to the genome, whereas the circles indicate the enrichment of the transJatome relative to the genome. Jn addition, we 
also show ihe enrichment for protein abundance from the Transposon abundance set, represented by the circles with the line through them, 
(b) Shows a different view of amino acid enrichment from that contained in (a) r now focusing on changes, and thus restricting the comparison 
to the genes common to all the data sets. The graph is ordered according to the enrichment from transcriptome to translatome (black squares). 
We focus here only on the changes for the abundance gene set (G Prol ) to exclude the effects that arise from looking at different subsets. In this 
view the enrichments from genome to transcriptome (white squares) and from genome to translatome (white diamonds) look more similar 
than do the analogous sets in (a). To make comparison with (a) easier we again show the enrichment from genome to the transcriptome for 
the complete gene set {C Gcn . shown in bars), (c) Shows biomass enrichment. The left panel depicts the average molecular weight per ORF 
(in units of kDa) and the right panel, the average molecular weight per amino acid (in units of Daltons) in each of the three stages of gene 
expression. The numbers inside the circles indicate the average molecular weights. The values next to the arrows indicate the enrichments 
in biomass between different populations. Both the circle diameters and the arrow widths are functions of the corresponding values (the 
hollow arrow indicates a positive value). It is very clear that the average molecular weight per ORF is much lower in the translatome (by 20 
or 15%) and transcriptome (by 29%) than in the genome. This relative depletion of biomass mainly lakes place as a result of transcription; the 
effect of translation is less clear, depending on the populations compared. On the other hand, the depletion in the average molecular weight 
per amino acid (-3.3% from genome to translatome) is an order of magnitude smaller than in the average weight per ORF. This shows 
that the yeast cell favors the expression of shorter ORFs over longer ones, and agrees with our earlier observation that there is a negative 
conelationbetwccn maximum ORF length and mRNA expression (Jansen and Cersiein, 2000); it seems that this effect mainly lakes place 
during transcription rather than translation. (d)This plot shows that the amino acid enrichments are statistically significant. We have assessed 
significance by randomly permuting the expression levels among the genes and then recomputing ihe amino acid enrichments. This procedure 
can be repeated and used to generate distributions of random enrichments that can then be compared against the observed enrichments. In 
the plot the gray bars represent the observed enrichments already shown in Figure 3a. On top of the gray bars we show standard boxplots 
of enrichment distributions based on lOOOiandom permutations. (The middle line represents the distribution median. The upper and lower 
sides of ihe box coincide with the upper and lower quartiles. Outliers are shown as dots and defined as data points that are outside the range 
of the whiskers, the Icngih of which is 1.5 the interquartile distance.) Based on the random distributions, we can compute one-sided ^-values 
for the observed cnuchmcnis. Amino acids for which the /?- values arc less than 10" 3 are shown in bold font. 



bled. However, we also found that ihe average molecu- 
lar weigh! pet amino acid differed much less between the 
transcriptome and the uanslaiome on the one hand, and the 



genome on the other hand (though it was still slightly iess). 
This finding indicates thai lower molecular weights in the 
translatome and transcriptome relative to ihe genome are 
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predominantly due to greater expression of shorter pro- 
teins rather than the incorporation of smaller amino acids. 

Secondary, structure composition. We also used our 
methodology to study the enrichment of secondary- 
structural features. Secondary structural annotation was 
derived from structure prediction applied uniformly to all 
the ORFs in ihe yeast genome as described in Table ). 
As shown in Figure 3a, all three populations — genome, 
transcriptome, and iransJatome — had a fairly similar 
composition of secondary structures — sheets, helices, and 
coils. The differences between populations were marginal 
and based only on the small subset of genes. 

We also found that Transmembrane (TM) proteins 
were significantly depleted in the transcriplome (see 
website and caption). These results are consistent with 
our previous analyses (Jansen and Gersiein. 2000). The 
protein abundance data does not have any membrane 
proteins. 

Subcellular Ivcaliwtion. Figure 3c .shows ihc enrich- 
ment of proteins associated with the various subcellular 
compartments. For clarity, we divided ihe cell into five 
distinct subcellular compartments, (see Table 1). We 
found thai, in comparison to the genome, both ihe tran- 
scripiome and iranslalome are enriched in cytoplasmic 
proteins. This is [rue whether wc make our comparisons in 



relation to the relatively large reference mRNA expression 
set or the smaller reference protein abundance set. As 
Figure 3c shows, the 2D gel experiments are clearly 
biased towards proteins from the cytoplasm. However, in 
the biased subset Cp IO t transcription and translation lead 
to an even higher fraction of cytoplasmic proteins in the 
transJatome. 

Functional categories. Finally, we compared the enrich- 
ment of various functional categories in both the trans- 
lators and the transcriptome (see Figure 3b). This gives 
us a broad yet informative view of the ceil as a whole. As 
described in Table I, we used the top-level of the MJPS 
scheme for the functional category definitions. We found 
brond differences between the various populations, with 
some of the functional categories showing strikingly high 
enrichments. 

DISCUSSION AND CONCLUSION 

We developed: (i) a methodology for integrating many dif- 
ferent types of gene expression and protein abundance into 
a common framework and applied this to a preliminary 
analysis; (ii)a procedure for scaling anil merging diffeienl 
mRNA and protein sets together; and (iii) an approach for 
computing the enrichment of various proteomic feainres in 
the population of transcripts and proteins. We showed that 
by analyzing broad categories instead of individual noisy 
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Fig. 3. Breakdown of the transcript ome and translaiorne in terms of broad categories relating to structure, localization, and function. All 
of the subfigures arc analogous to the schematic illustration in Figure I. (a) Repiesents the composition of secondary structure in the 
different populations, (b) Represents the distribution of subcellular localizations associated with proteins in the various populations. We used 
standardized localizations developed earlier (Drawid and Gerstcin, 2000), which, in turn, were derived from the MIPS. YPD, and SwissProt 
databases (Bairoch and Apweiler, 2000; Costanzo et al, 2000; Mewes et al, 2000). The subcellular localization has been experimentally 
determined for less than half of the yeast proteins, so our analysis applies only lo this subset, (c) Shows the division of ORFs into different 
functional categories (according to the MIPS classification) in the various populations. Only the largest functional categories, of the top level 
of the MIPS classification arc shown. The group olher' contains the smaller top-level categories lumped together. This 'other* group is 
different from the group 'unclassified,' which contains genes without any functional description. 



data points, we couJd find logical trends in ibe underlying 
dara. For example, individual transcription factors might 
have higher or lower protein abundance than one expects 
from their mRNA expression, hut the category 'transcrip- 
tion factors' as a whole has a similar representation in the 
transcriptome and translatome. 

We found, as previously described (Futcher ei al.. 1999; 
Gygi ei al, 1999b: Greenbaum ei al.. 2001). a weak 
correlation between individual measurements of mRjNA 



and protein abundance. The outliers of this correlation 
lend to be associated with cellular organization. One 
might conceive of using these outliers (i.e. those with 
significant v different transcript ional and translational 
behavior) to find consensus regulatory- sequences. One 
possible method would involve using predicted mRNA 
structures ( Jaeger et al, 1990; Zuker, 2000) to find and 
investigate consensus structural elements in these outliers 
to which the yeast translational machinery is known to be 
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Table 1. Data sets 



Data set 



Description 



Size jORFs) 



Reference 



mRNA expression 
Young 

Church 

Samson 

SAGE 

Reference expression 

Protein abundance 
2-DE#! 

2-DE#2 
Transposon ■ 

Reference abundance 

Annotation 

Annotated localization 

TNI segments 

MIPS functions 

COR secondary structure 



Gene chip profiles yeast cells with ' • 
mutations that affect transcription 

Gene chip profiles of yeast cells 
under four different conditions 

Comparing gene chip profiles for 
yeast cells subjected to alkylating agent 

Yeast cells during vegetative growth 

Scaling and integrating the mRNA 
expression set into one data source 



Measurement of yeast protein 

abundance by 2D gel 

electrophoresis and mass spectrometry 

Similar to 2- DE set #1 

Large-scale fusions of yeast genes 
with locZ by transposon insertion 

Scaling and integrating the 2-DE 
data sets into one data source 



Subcellular localizations of yeast 
proteins 

Predicted TNI and 
soluble proteins in yeasl 

Functional categories for yeast 
ORFs 

Predicted secondary structure 
yeasl ORFs 



5455 

6263 

6090 

3778 
6249 

156 

71 
1410 

181 



2133 
(6280) 

27)0 
(6280) 

3519 
(6194) 

6280 



Holsiege */<!/. (1998) 
Roth et at (1998) 
Jclinsky and Samson (1999) 
Velculescu et at ( 1 997) 

GygirtflUJ999a,b) 

Futcher era/. (1999) 
Ross-Macdonald etol (1999) 



Drawid and Gerstein (2000) 
Gerstein ( 1998 a. b r c) 
Mewes ct at. (2000) 
Gerstein ( 1 99Sa.b.c) 



This table provides an overview of the data sets used in our analysis. The table, is divided into three sections. The top section lists different mRNA expression 
sets. The middle section shows the protein abundance data sets used. The bottom section contains different annotations of protein features. The column 'Data 
set' lists a shorthand reference to each daia set used throughout this paper. The next columns contain a brief description of the data sets, the number of ORFs 
contained in each of them, and the literature reference. In cor.uast to the other data we investigated, the reference expression and abundance data sets have 
been calculated for the purpose of our analysis (see text). An expanded version of the table is available on our web site. 
Some funhe r information on the genome annotations: 

Localization. Protein localization information from YPD. MIPS and SwissProt were' merged, filtered and standardized (Dairoch and Apweiler. 2000; 
Costan20 et ai. 2000; Mewes ct at., 2000) into five simplified compartments— cytoplasm, nucleus, membrane, extracellular (including proteins in ER and 
golgi). and mitochondrial— according to rhe protocol in Drawid ft at. (2000). This yielded a standardized annotation of protein subcellular localization for 
2133 out of 62S0 ORFs.' 

TMiegmrntj. In 2710 out of 6280 yeast ORFs TM segments arc predicted to occur, ranging from low to high confidence (732 ORFs). The TM prediction was 
pei formed as follows: the values from the scale foi amino acids in a window of size 20 (the typical size of a TM helix) were averaged and then compared 
against a cutoff of - 1 kcal rnol" l . A value under this cutolf was tafcen to indicate the existence of a TM helix. Initial hydrophobic stretches conesponding to 
signal sequences for membrane insertion were excluded. (These have the pattern of a charged residue within the first seven, followed by a soetch of 14 with 
an average hydiophobiciry under the cutoff.) These parameters have been used, tested, and refined on surveys of membrane protein in genomes. 'Sine' 
membrane proteins had at least rwoTM-segmcnts with an average hydrophobic ily less than -2 kcal moP' tRost etctl.. 1995; Gerstein rtal.. 2000; Santoni 
ct nl. 2000; Senes ft aL 2000). 

Funaions. MIPS furrnional categories have been a5signed to 3M9 mil ol 6194 ORFs. (The remainder are assigned to category '98' Or '99.' which 
corresponds to unclassified function.) 



sensitive ^JcCa.ihy, 1998). 

In relation to funclional categories, we found three 
trends that were particularly notable: (i) the 'cellular 



organization.' protein synthesis/ and energy production' 
categories were increasingly enriched 3S we moved Irom 
genome to iranscripiomc to iranslaiome. In ihe iranscrip- 
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tome and trans Jatome population relative to the genome; 

(ii) proteins with 'unclassified function' are significantly 
depleted, perhaps reflecting a bias against studying them; 

(iii) proteins in the 1 transcription ' and "cell growth, cell 
division, and DNA synthesis 7 categories were consistently 
depleted. This reflects the fact that many of these proteins, 
such as transcription factors, act as 'switches' such that 
only small quantities of the protein are necessary to 
activate or deactivate a process. These results concur with 
previous calculations (Jansen and Gerstein, 2000) wherein 
we found the transcriplome is enriched specifically with 
proteins involved in protein synthesis and energy. 

Limitations given the small size of the protein 
abundance data 

Even with the extended coverage made possible by 
merging many data sets together into reference sets, the 
analysis is still limited by the minimal data. This was 
most applicable to the protein abundance measurements, 
' potentially biasing our statistical results towards certain 
protein families. Moreover, the 181 proteins in Gp ro , do 
not represent a random sample. They are skewed towards 
highly expressed, well-siudied proteins. Our methodology 
attempts to control for this gene- selection bias through 
our enrichment formalism, which allows one to rather 
precisely gauge various aspects of the bias. Conversely, 
many protein features in both the translatome and the 
iranscriptome are dominated by highly expressed proteins. 
Under these circumstances, it is often sufficient to look at 
this smaller number of dominating proteins to characterize 
the whole population. This is similar to the development of 
the codon adaptation index for yeast (Sharp and Li, 1987). 
While based on only 24 highly expressed proteins, it has 
proven to be robust in predicting expression levels for the 
entire genome. 

We believe that the essential formalism and approach 
that we develop will remain quite relevant for future data 
sets' (Smith, 2000). 

ACKNOWLEDGEMENT 

M.G. thanks the Keck foundation for support. 

REFERENCES 

An,H.. Scopcs.R.K. et cl. (1991) Gel eiecirophorctic analysis 
of Zymomonas mobilis glycolytic and fermentative enzymes: 
idrrnificaiion of alcohol dehydrogenase II as a stress pioiein. / 
Bncicriot.. 173. 5975-5982. 

AndcrsoiU.. :ind SeilhamciJ. (1997) A comparison of selected 
mRNA and protein abundances in human livei. Electrophoresis, 
)S, 533-537. 

Bairoch.A. (2000) Seiendipily in bioinformatics, the tribulations of 
a Swiss bicin formal ician through exciting limes! Biqinfnniuttirs. 
76.-t.t-64. 



Bairoch,A. and ApweilerJR. (2000) The SWJSS-PROT protein 
sequence database and its supplement TiEMBLin 2000. Nucleic 
. Acids Res., 28, 45-48. 

BassettJXE. Jr., Basrai,M.A. et aL (1996) Exploiting the complete 
yeast genome sequence. Curr. Opin. Genet. Dev., 6, 763-766. 

Bailee,!. , Benito, YA. et aL (1992) A possible in vivo mechanism 
of intermediate transfer by glycolytic enzyme complexes: steady 
state fluorescence anisotropy analysis of an enzyme complex 
format ion. Arch: Btochem. Biophys., 296, 654-659. 

CambilJau.C. and aaverieJ.M. (2000) Structural and genomic 
conelates of hyrxrtherroostability. J. Biol. Chem., 275, 32 383- 
32386. 

CavaJcoJU.D., VanBogelen,R.A. et aL (1997) Unique identification 
of proteins from small genome organisms: theoretical feasibility 
of high throughput proteome analysis. Electrophoresis, 18, 
2703-2708. 

ClaverieJ.M- (J 999) Computational methods for ihe identification. 

< of differential and coordinated gene expression |in process 
citation). Hum. MoL Genet., 8, 1 82 J— 1 832. 

Corthals,G., Wasinger.V.C, HochstrasserTJ.F. and Sanchez,J.C. 
(2000) The dynamic range of protein expression: a challenge for 
proteomic research. Electrophoresis, 21, 1 104—1 

Costa nzo.M.C, Hogan,J.D. et at (2000) The Yeast Proteome 
Database ( YPD) and Caenorhabdilis elegans Proteome Database 
(WormPD): comprehensive resources for ihe organization and 
comparison of model organism protein information. Nucleic 
Acids Res., 28. 73-76. 

Das,R. and Gersiein.M. (2000) The stability of thermophilic pro- 
teins: a study based on comprehensive genome comparison. 
Funct. Int.Genom., J, 33-45. . 

Doolillle.W.F (2000) The nature of the universal ancestor and the 
evolution of the proteome. Curr. Opin. Struct. Biol., 10, 355-35S. 

Drawid.A. and Gerstcin.M. (2000) A Baycsian system integrating 
expression data with sequence patterns for localizing proteins: 
comprehensive application to the yeast genome. J. Mol. Biol. T 
301,1059-1075. 

Drawid,A., Janscn.R. et nl. (2000) Gene expression levels are 
correlated with protein subcellular localization. Trends Genet., 
30,426-430. 

Einarson,M. and Golemis.E. (2000) Encroaching genomics: adapt- 
ing large-scale science to small academic laboratories. Physiol. 
Genom., 2, 85-92. 

Eisen,M.B. and Brown.P.O. ( 1999) DNA arrays for analysis of gene 
expression. Meih. Enzymol., 303, 179-205. 

Epsiein,C. and Bmow,R. (2000) Mitioarray technology — enhanced 
versatility, persistent challenge. Cun. Opin. BictechnoL, 11, 36- 
41. 

Ferea,T and Brown.P. (1999) Observing the living genome. Curr. 

Opin. Genet. Dev., 9, 715-722. 
Fey.S.J., Nawiocki,A. et nl. (1997) Proteome analysis of Saecka- 

rotnyces cerevisioe: a methodological outline. Electrophoresis, 

38, 1361-72. 

Fcv.SJ. and Larsen.PM. i?00I) 2D «>r not 2D. Two dimensional eel 
electrophoresis. Curr. Opin. Chem. Biol.. 5. 26-33. 

Frishrnan.D. and Mcwes.H.W. (1997) Protein structural classes in 
five complete genomes {letter J. Nat. Struct. Biol., 4. 626-628. 

Frishrnan.D. and Mcwes.H.W. (1999) Genome- based structural 
biology. Pw$. Biophys. Mol. Bio!.. 72. 1-17. 



594 



FulcherJB., Latter ,G. et al ( I 999) A sampling of the yeast proteome. 

Mol Cell Biol. \% 7357-7368. 
Gaasterland,T. (1999) Archaeal genomics. Curr. Opin. Microbiol* 

2,542-547. 

Ganels,JJ., McLaughlin.CS. et al (1997) Proteome studies of 
Socckarqmyces cerevisioe: identi6cation and characterization of 
abundant proteins. Electrophoresis, 18, 1347-1360. 

Gerstein,M. (1997) A structural census of genomes: comparing 
bacterial, eukaryotic, and archaeal genomes in terms of protein 
. structure. J. Mol Biol, 274, 562-576. 

Gerstein,M. ( 1 998a) How representative are the known structures of 
the proteins in a complete genome? A comprehensive structural 
census. Fold. Des. , 3, 497-5 12. 

Gersiein,M. (1998b) Patterns of protein-fold usage in eight micro- 
bial genomes: a comprehensive structural census. Proteins, 33, 
5I&-534. 

Gerstein.M. (1998c) Patterns of protein-fold usage in eight micro- 
bial genomes: a comprehensive structural census. Proteins ,' 33, 
518-534. 

GeisteinJVl- and Hegyi,R (1998) Comparing genomes in terms of 
protein structure: surveys of a finite parts list. FEMS Microbiol 
Rev., 22, 277-304. 

Gerstein,M. and Jansen.R. (2000) The current excitement in bioin- 
formatics, analysis of whole- genome expression data: how does 
it relate to protein structure and function. Curr Opin. Struct. 
Biol.iO, 574-584. 

GeTstein.M., Lin J. et al. (2000) Protein folds in the worm genome. 
Pac. Symp. Biocqmput., 30-41. 

Greenbaum,D.. Luscombe.N. et al (2001) Interrelating different 
lypes of genomic data, from proteome to secretome: 'oming in 
on function. Genome Res., 11, 1463-1468. 

Gygi.S.P, Rist,B. et al (1999a) Quantitative analysis of com- 
plex protein mixtures using isotope-coded affinity lags. Nature 
Biotechnol, 994-999. 
- Gygi.S.P, Rochon.Y. et al. ( 1 999b) Correlation between protein and 
m RNA abundance in yeast. Mol. Cell Biol., J 9, 1720-1730. 

Gygi.S.P, Conhals,G.L. et al <2000n) Evaluation of two- 
dimensional gel electrophoresis- based proteome analysis tech- 
nology. Proc. Natt Acad Sci. USA, 97. 9390-9395. 

Gygi.S.P, Rist.B. et al. (2000b) Measuring gene expression by 
quantitative proteome analysis. Our. Opin. Biotechnol., 11, 396- 
401. 

Harry J. L.. WiJkins,M.R. et al (2000) Pioteomics: capacity versus 

utility. Electrophoresis, 21, 1071-1081. 
Hat2irnamkatis,V., Choe.LH. et al. (1999) Protcomics: theoretical 

and experimental considerations. Biotechnol. Prog., 15. 312- 

318. 

Haynes.P.A. and Yatcs,J.R. (2000) Proteome profiling- pitfalls and 

progress. Yeast. 17, S 1-87. 
MegyLH. and Gerstein.M. (1999) The relationship between protein 

Mruciure and function: a comprehensive survey with application 

to ihe yeast genome. / Mol Biol.. 288. J47-IM. 
Molslegc.F.C, Jrnning5,E.G. rt al (1998) Dissecting ihe regulatory 

circuitry of :i ciikarvotic genome. Cell 95. 1 17-728. 
Ishii.M., Hashimoio.S. et al (200O) Direct comparison of gencchip 

and SAGE on ihe quantitative accuracy in itansciipt profiling 

analysis. Genomics, 68. 136-143. 
JtO.T.. Tashito.K. et al. (2000) Towaid :i protein- protein interaction 



mFtNA expression and protein abundance data 



map of the budding yeasl: a comprehensive system to examine 
two-hybrid interactions in all possible combinations between the 
yeast proteins. Proc. Noil Acad. ScL USA , SI, 11 43- 1 1 47.' 
Jaeger,!. A., Turner JXH. et al (1990) Predicting optimal and 
suboptimal secondary structure for RNA. Meth. Enzymol, 183, 
281-306. 

JansenX and Gersteinjvl. (2000) Analysis of the yeast transcrip- 

tome with structural and functional categories: characterizing 

highly expressed proteins. Nucleic Acids Res., 28, 1481-1488. 
Jelinsky^.A. and SamsonJLD. (1999) Global response of Saccha- 

romyces cerevisiae to an alkylating agent. Proc. Natl Acad. ScL 

US*, 96,1486-1491. 
Jones.D.T. (1998) Do transmembrane protein superfolds exist? 

FEBSLetu, 423, 281-285. 
Jones pX { 1 999) GenTHREADER: an efficient and reliable protein 

fold recognition method for genomic sequences. J. . Mol Biol, 

287,797-815. 

Kidd4>. et al (2001) Profiling serine hydrolase activities in complex 

proteomes. Biochemistry* 40, 4005-4015. 
Klose,J. (1975) Protein mapping by combined isoelectric focusing 

and electrophoresis of mouse tissues., A novel approach to 

testing for induced point mutations in mammals. Humangenetik, 

26,231-243. 

Krogh.A. et al (2001) Predicting transmembrane protein topology 
with a hidden Markov model: application to complete genomes. 
J. Mol Biol, 305, 567-580. 

Lin,J. and Gerstein.M. (2000) Whole-genome trees based on the 
occurrence of folds and orthologs: implications for comparing 
genomes on different levels. Genome Res., 30, 808-8 J 8. 

Lipshutz.R.F. S., Gingeras.T.R. and Lockhart.D.J. (1999) High 
density synthetic oligonucleotide a/rays. Nature Genet., 21. 20- 
24. 

Lopez.M.F. (2000) Better approaches to finding the needle in a 

haystack: optimizing proteome analysis through automation. 

Electrophoreisis, 21.1 082- J 093 . 
MacBeath.G. and Schieiber.S.L. (2000) Printing proteins as mi- 

croarrays for high- throughput function determination. Science. 

289, 1760-1763. 

Matton.D.R. Consiabel.P. et al (1990) Alcohol dehydrogenase gene 

expression in potato following elicitor and stress treatment. Plant 

Mol Biol, 34, 775-783. 
McCarthy ,J.E. (1998) Posiiranscripiional control of gene expression 

in yeast. Microbiol Mol Biol Rev., 62, 1492-1553. 
Mewes.H.W.. Frishman.D. et al (2000) MIPS: a daiabase for 

genomes and protein sequences. Nucleic Acids Res., 2$, 27-40. 
Millai.A.A., Olive.M.R. ct al (1994) The expression and anaerobic 

induction of alcohol dehydrogenase in cotton. Biochcm. Genet., 

32. 279-300. 

MolIoy.M.P. (2000) Two dimensional electrophoresis of membrane 
proteins using immobilized pll gradients. Anal Biochcm., 280, 
1-10. 

Nauchitcl.V.V. and SomorjaiJR.L. (1994). Spatial and Iree energy 
distribution patterns of amino acid residues in water soluble 
proteins. Biophys. Chem., 5J. 327-336. 

NeJson.R.W.. NedcIkov.D. ft al (2000) biosensor chip mass spec- 
trometry: a chip- based pioteomics appjoat h. Electrophoresis, 21. 
1155-1163. 

OTarrcIl.P.H. (1975) High resolution two-dimensional elcc- 



595 



D.Greenbavm et at 



trophoresis of proteins. J. Biol Chem, 250, 4007-4021. 

PandeyA and Mann,M. (2000) Proteomics to study genes and 
genomes. Nature, 405. 837-846. 

Qi,S.Y., Moir,A. et al (1996) Proteome of Salmonella typhimurium 
SLI344: identification of novel abundant cell envelope proteins 
and assignment to a two-dimensional reference map. J. Bade-. 
rial, 178, 5032-5038. 

Ross-Macdonald.R. Coeiho,P.S. et at. (1999) Large-scale analysis 
of the yeast genome by transposon tagging and gene disruption. 
Nature, 402. 413-418. 

Rost.B., Casadio,R. et al (1995) Transmembrane helices predicted 
at 95% accuracy. Protein Sci. r 4, 52 1-533. 

Roth^ER, HughesJ.D. et al (1998) Finding DNA regulatory motifs 
within unaligned noncoding sequences clustered by whole- 
genome mRNA quantitation. Nature Biotechnol., 16, 939-945. 

Rubin.G.M., Yandell,M.D. et al (2000) Comparative genomics of 
the eukaryotes. Science, 287, 2204-2215. 

SaJi.A: (1999) Functional Jinks between proteins. Nature, 402, 25- 
26. 

Santoni.V, Mol)oy,M. et al (2000) Membrane proteins and pro- 
teomics: un amour impossible? Electrophoreisis, 21, 1054-1070. 

Schena,M., ShalonJO. et al (1995) Quantitative monitoring of gene 
expression patterns with a complementary DNA microarray. 
Science, 270, 467-470. 

Searls.D.B. (2000) Using bioinformatics in gene and drug discov- 
ery. Drug Discov. Today, 5, 135^143. 

Scnes,A., Gerslein.M. et al (2000) Statistical analysis of amino 
acid patterns in transmembrane helices: the GxxxG motif occurs 
frequently and in association with beta-branched residues at 
neighboring positions. J. Mai. Biol, 296, 921-936. 

Shapiro.L. andHanis.T. (2000) Finding function through structural 
genomics. Curr. Opin. Biotechnol, 11,31-35. 

Sharp.P.M and Li.W.H. (1987) The codon adaptation index— a 
measure of directional synonymous codon usage bias, and its 
potential applications. Nucleic Acids Res., 15, 1281- J295. 

Sherlock.G. (2000) Analysis of large-scale gene expression data. 
Curr. Opin. Immunol, 12, 201-205. 



ShevchenkoA. Jensen,O.N. et al (1996) Linking genome and 

proteome by mass spectrometry: large-scale identification of 

yeast proteins from two dimensional gels. Proc. Natl Acad. Sci. 

USA.93, 14440-14445. 
Smilh,R.D. (2000) Probing proteomes-seeing the. whole picture? 

Nature Biotechnol, 18, 104 1 -1042. 
Tatusov^R.U. Kooning. V. et al (1997) A genomic perspective on 

protein families. Science, 278, 631-637. 
Tekaia.F., Lazcano^A. et al (1999) The genomic tree as revealed 

from whole proteome comparisons. Genome Res. , 9, 550-557. 
Velculescu, V.E., Zhang,L- et al ( 1 997) Characterization of the yeast 

transcriptome. Cell, 88, 243-251. 
Wallin.E. and von Heijne.G. (1998) Genome- wide analysis of 

.integral membrane proteins from eubacterial, archaean, and 

e uk aryotic organisms. Protein Sci. „ 7, 1 029- 1 038. 
Washburn,M.R, WoltersJ). et al (2001) Large-scale analysis of 

the yeast proteome by multidimensional protein identification 

technology. Nature Biotechnol, 19, 242-247. 
Washburn>J.R and Yates^.R. 3rd (2000) Analysis of the microbial 

proteome. Curr. Opin. Microbiol, 3, 292-297. 
WittesJ. and Friedman.H.P. (1999) Searching for evidence of 

altered gene expression: a comment on statistical analysts of 

microarTay data [editorial; comment). J. Natl Cancer. Inst, 91, 

400-401. 

Wolf.Y.l., Brenner,S.E et al (1999) Distribution of protein folds in 
the three superkingdoms of life. Genome Res., 9, 17-26. 

Young,K.H. (1998) Yeast two-hybrid: so many interactions, (in) so 
little time . . . : Biol Reprpd. , 58, 302-3 II. { 

Zhang>l-0- (1999) Large-scale gene expression data analysis: 
a new challenge to computational biologists (published erra- 
tum appears in Genome Res., 1999, 9, 1 156). Genome Res., 9. 
• 681-6S8. 

Zhu,H., Klemic.LF. et al (2000) Analysis of yeast protein kinases 
using protein chips. Nature Genet., 26, 283-289. 

Zuker.M. (2000) Calculating nucleic acid secondary structure. Curr. 
Opin. Struct. Biol, 10, 303-310. 



596 



Vol. 7 5-22, January 200 1 Clinical Cancer Research 



Revieyv 

Early Detection of Lung Cancer: Clinical Perspectives of Recent 
Advances in Biology and Radiology 1 



Fred R. Hirscb, 2 Wilbur A. Franklin, 
Adi F. Gazdar, and Paul A. Bunn, Jr. 

Lung Cancer Program and Departments of Medicine and Pathology. 
University of Colorado Cancer Center, Denver, Colorado 80262 
IF. R.H., W. A. F., P. A. B.J; Department of Pathology, University of 
Texas, Southwestern Medical Center, Dallas, Texas (A. F. G.); and 
Department of Oncology, Finsen Center, National University 
Hospital, Copenhagen, Denmark [F. R. H.J 



Abstract 

Lung cancer is the most common cause of cancer death 
in developed countries. The prognosis is poor, with less than 
15% of palicnts surviving 5 years after diagnosis. The poor 
prognosis is attributable to lack of efficient diagnostic meth- 
ods for early detection and lack of successful treatment for 
metastatic disease. Most patients (>75%) present with stage 
HI or IV disease and are rarely curable with current ther- 
apies. Within the last decade, rapid advances in molecular 
biology, pathology, bronchology, and radiology have pro- 
vided a rational basis for improving outcome. These ad- 
vancements have led to a better documentation of morpho- 
logical changes in the bronchial epithelium before development 
of clinical evident invasive carcinomas. This has changed our 
concept of lung carcinogenesis and emphasized the multistep 
carcinogenesis approach on several levels. Combined with 
the technical developments in bronchoscope techniques, 
e.g., laser-induced fluorescence endoscope (LIFE) bronchos- 
copy, we now have improved methods to localize preinvasive 
and early-invasive bronchial lesions. With the LIFE bron- 
choscope, a new morphological entity (angiogenic squamous 
dysplasia) has been recognized, which might be an impor- 
tant biomarker and target for antiangiogenic rhcrnopreven- 
tive agents. To reduce the mortality of lung cancer, these 
new technologies have been taken into the clinic in different 
scientific settings. The use of low-dose spiral computed to- 
mography in the screening of a high-risk population has 
demonstrated the possibility of diagnosing small peripheral 
tumors that are not seen on conventional X-ray. A shift in 
the therapeutic paradigm from targeting advanced clinically 
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manifest lung cancer toward asymptomatic preinvasive and 
early-invasive cancer is occurring. The present article re- 
views the recent advances in the diagnosis of preinvasive and 
early-invasive cancer to identify biomarkers for early detec- 
tion of lung cancer and for chemoprevention studies. 

Introduction 

Lung cancer is the most common cause of cancer deaths in 
the countries of North America and other developed countries, 
accounting for 29% of all cancer deaths and more dealhs than 
from prostate, breast, and colorectal cancer combined in the 
United States (1). Lung cancer will be diagnosed in ~- 170,000 
new patients in the United Stales in the year 2000, and < 15% of 
them will survive 5 years after diagnosis (1). The prognosis for 
the patients with lung cancer is strongly correlated to the stage 
of the disease at the time of diagnosis. Whereas patients with 
clinical stage )A disease have a 5-year survival of about 60%, 
the clinical stage 11-1 V disease 5-year survival rate ranges from 
40% to less than 5% (2). Over two- thirds of the patients have 
regional lymph-node involvement or distant disease at the lime 
of presentation (3). The poor prognosis is largely attributable to 
the lack of effective early detection methods and the inability to 
cure metastatic disease. The unsatisfactory cure rales supports 
efforts aimed at early identification and intervention in lung 
cancer. 

Historically, the only diagnostic tests available for the 
detection of lung cancer in its early stages were chest radiogra- 
phy and sputum cytology. The efficacy of these tests as mass 
screening tools was evaluated in controlled trials sponsored by 
Ihe NC) J and conducted at Johns Hopkins University, Memorial 
Sloan-Kcltering Cancer Center, and the Mayo Clinic during Ihe 
1970s (4-6). The principal goal of these studies was to deter- 
mine whether a reduction in lung cancer mortality could be 
achieved by adding sputum cytology testing to annual screening 
by chest radiography. Results from these trials showed that both 
tests could detect presympiomatic, eaily-stage carcinoma, par- 
ticularly of squamous cell type. Reseclability and survival rates 
were found to be generally higher in the study groups than in the 
control groups. However, improvements in rcsec lability and 
survival did not lead to a reduction in overall lung cancel 
mortality, the most critical end point. A subsequent study of 
6346 Chechoslovakian male smokers also found no reduction in 
lung cancer mortality aftei dual screening by chesi radiogiaphy 



•'The abbrc\ rations used are: N'CI. Notional Cancer Institute. CIS, 
caicmoma in situ', CI. computed tomography; ASD. angiogenic sc|u:i- 
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lung carcinoma; WLR. white lighj bronchoscopy; LIFE, laser- induced 
fluorescence endoscope: ELCAP. Eaily Lung Cancer Action Project; 
PET. positron emission tomography. FOG. f ir l : jfluoro-2-deo\vglwi nse. 
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and sputum cytology (7). The negative results from these 
screening studies lead the NCJ and other health policy and 
research groups to conclude that mass screening programs in- 
volving periodic sputum cytological evaluation and chest radio- 
graphs could not be justified. However, controversies in the 
methodology and interpretation of the data from these studies 
have later been extensively discussed (8, 9). One additional 
study of annual chest X-ray screening is currently being con- 
ducted by the NCI; The Prostate-, Lung-, Colorectal-, and Ovar- 
ian (PLCO) screening trial. This trial includes individuals 55-74 
years old, but they are not selected for this trial on the basis of 
high risk for lung cancer (e.g., smoking history with > 20 
pack-years). 

The failure of clinical trials to demonstrate the efficacy of 
sputum cytology and chest radiography as mass screening tools 
has resulted in a search for better diagnostic approaches for 
early lung cancer detection that take advantage of recent devel- 
opments in molecular biology, gene technology, and radiology 
(10). Furthermore, as has been the case for mammography 
screening for breast cancer, it has also been important to identify 
risk groups for lung cancer. 

Although, much is known about predisposing factors, nat- 
ural history, and the outcome based on histology and stage, our 
understanding remains very incomplete in many areas. What are . 
the early premalignanl changes roolecularly, biochemically, and 
morphologically? Which changes are reversible and which are 
not? What research tools are available to provide answers to 
these questions? The identification of preinvasive lesions allows 
for developing promising methods for early intervention (11). 
The therapeutic paradigm and focus arc today shifting from 
targeting only clinically verified lung cancer as previously to- 
ward targeting the premalignant and early- malignant lesions. 
Furthermore, the prospect of lung cancer screening has today 
become more meaningful as a consequence of recent develop- 
ments in biology and radiology and better possibilities to define 
high-risk populations most suitable for lung cancer screening 

on 

The present article will focus on the clinical perspectives of 
our biological knowledge of premalignant and early-malignant 
lesions and the potential of the recent technological advance- 
ment for early diagnosis of lung cancer. 

Pathology of Preinvasive and Early Invasive 
Bronchial Lesions 

Most of the efforts to classify lung cancer have been 
directed toward invasive carcinoma (13). However, better un- 
derstanding of the pathogenesis of lung cancer aroused renewed 
interest in morphological abnormalities that fall short of inva- 
sive carcinoma but may indicate initiation of carcinogenesis. 
These morphological abnormalities are referred to as preinva- 
sive lesions and are shown in Fig. 1. The last edition of the 
WHO classification of lung tumors included the classification of 
preinvasive lesions as a separate section. Numerous recent stud- 
ies have indicated iltat lung cancer is riot the result of a sudden 
transforming event in the bronchial epithelium but a muhistcp 
process in which gradually accruing sequential genetic and 
cellular changes result in rhc formation of an invasive [:.e.. 
malignant) tumor Mucosal changes in the large airways, that 



may precede or accompany invasive squamous carcinoma in- 
clude hyperplasia,, metaplasia, dysplasia, and CIS (14): Hyper- 
plasia of the bronchial epithelium and squamous metaplasia 
have generally been considered reversible, and not premalignant 
. . in the sense of squamous dysplasia and CIS (J 5). 

Squamous metaplasia is a common finding, especially as a 
response to cigarette smoking. Peters et ah (J 6) studied bron- 
choscope biopsies from six sites in 1 06 heavy cigarette smok- 
ers; Squamous metaplasia was noted at one or more biopsy sites 
in approximately two- thirds of the group, and one- fourth 
showed squamous metaplasia in three or more biopsy sites. The 
incidence of squamous metaplasia increased with smoking his- 
tory and was highest in individuals who had smoked more than 
two packs of cigarettes a day. Auerbach et ai (17) noted similar 
findings in autopsy tissues: basal cell hyperplasia and squamous 
metaplasia are increased in smokers in proportion to smoking 
history. Hyperplasia and metaplasia are believed to be reactive 
changes in the bronchial epithelium, as opposed to true preneo- 
plastic changes (17, 18). The reasons for this include: (a) they 
are frequently found in association with chronic inflammation, 
and may be induced by mechanical trauma; {b) they spontane- 
ously regress after smoking cessation; (c) in chronic smokers, 
the molecular changes present in these lesions are similar to 
those present in histologically normal epithelium; and {d) there 
are no reports linking their presence to increased risk for devel- 
oping lung cancer. In contrast, moderate-to-severe dysplasia and 
CIS lesions seldom regress after smoking cessation (19). 

Dysplasia and CIS are changes that frequently precede 
squamous cell carcinoma of the lung. Saccomanno et aL (20) 
studied more than 50,000 samples from 6,000 men, many of 
whom had worked in the uranium mining industry. Both smok- 
ing and uranium mining (radon exposure) were found to be 
associated with increased incidence of dysplasia, CIS, and in- 
vasive cancer. The studies of Saccomanno et ai established that 
increasing degrees of sputum arypia may be recognized an 
average of 4-5 years before the development of frank lung 
carcinoma. 

Another question is: which grades of sputum arypia pro- 
gress, to C3ncer? From the Johns Hopkins cohort of the NCI 
chest X-ray/sputum screening trial, we know thai among indi- 
viduals with moderate arypia on sputum screening. - 10% de- 
veloped known cancer up to 9 years later. Among individuals 
with severe afypia on the sputum screening, >40% developed 
known cancer during the same time period (21). Although there 
are d3ta in the literature showing the relationship between 
sputum arypia and subsequent invasive cancer, there is still very 
little information about the histological progression in the bron- 
chial mucosa in the high risk populations. In a recent publica- 
tion, nine patients with CIS were followed with autofluorcs- 
tence bronchoscopy at regular intervals, and 5 (56%) had 
progression to invasive cancer despite endobronchial therapy 
{77). The number of invasive cancers might even have been 
higher if treatment had not been not given. Ongoing studies of 
high-risk subjects [e.g.. the Colorado sputum cohort study) 
including serial follow-up bronchoscopies w ill provide evidence 
related to the frequency of development of invasive lung cancer 
as it relaies to smoking history, airflow obstruction, and sputum 
atypia. 

Since the previous WHO- classification was published in 
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Fig. I A f squamous metaplasia. The 
cells are widely dispersed, with a reg- 
ular maturation from the basal region 
to the top. There is keratinization, and 
the nudei/cyiopJasmic ratio is low. B, 
moderate dysplasia with ASD. Hypcr- 
cellulnriry of the epithelium with in- 
complete maturation and micropapil- 
lary invasion of capillaries are seen. 
The nuclei/cytoplasm ic ratio is high. 
C, severe dysplasia. There is marked 
pleomorphism of the cells with irregu- 
larity and prominent nucleoli. 
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1981, two nonsquamous lesions have been added la ibe WHO 
classification of premalignant lesions: atypical . alveolar hyper- 
plasia and diffuse idiopathic neuroendocrine cell hyperplasia 
(]3). Both of these lesions are diagnosed rarely. The former 
consists of lesions <5 rom in diameter and composed of a 
peripheral epithelial cell proliferation with minimal etiological 
atypia or stroma] response and resembles bronchioloalveolar 
carcinoma The lesion has been seen in lung specimens resected 
for rung cancer, bu! no prospective significance, of this lesion 
has been reported. However, this morphological lesion may play 
a role for Ihe pathogenesis of peripheral lung adenocarcinomas 
(23, 24). The resolution of spiral CT (currently about 3 mm) 
approaches the diameter of these lesions, and it is anticipated 
that atypical alveolar hyperplasia will be increasingly encoun- 
tered in subjects undergoing this procedure (25). Diffuse idio- 
pathic neuroendocrine cell hyperplasia consists of a patchy 
increase in the number of well-differentiated neuroendocrine 
cells in the bronchioles. This process may result in the formation 
of small carcinoid rumors, and for this reason it is considered 
"preinvasive." To date, small cell carcinomas have not been 
associated with this lesion (13). 

Recently, the use of fluorescence bronchoscopy (see be- 
low) has increased the recognition of dysplastic lesions in the 
large airways and a new morphological entity, ASD, was iden- 
tified (26). Dysplasia of bronchial epithelium in "micr ©papillo- 
matosis'* and the possible J ink between angiogenesis and prein- 
vasive bronchial epithelial dysplasia were recognized as early as 
1 983. by Muller and Muller (27), who also described the ullra- 
strucrure of these lesions. It has been suggested thai this angio- 
genesis, which is recognized as capillary loops projecting into 
the dysplastic bronchial lining, is responsible for the reduced 
fluorescence seen in dysplastic lesions by LIFE bronchoscopes 
(Figs. 1 and 3; Ref. 26). Funue prospective studies will show 
whether this morphological entity is correlated wilh a progres- 
sion to hmg cancer so as to be a target for the use of antiangio- 
genic agents for chemoprcvention. 

In general, there are several questions/problems relating to 
premalignant lesions, which will be addressed in future studies: 

(a) The morphological criieria for premalignant and early- 
mahgnant changes., both on spumm cytology and in DTonchial 
biopsies, have to be validated for intra- and interobserver repro- 
ducibility. 

(b) Uniform and reproducible morphoJogical/cyiologica) 
criteria have 10 be published more extensively, and a training sel 
of slides should be available. By ihe use of Internet technology, 
this could be more easily facilitated (78). 

(c) The correlaiion of sputum atypia and histological 
changes in the bionchi in hie'irisl: population is not well 
defined. 

(d) Tire natural coujsc of preinvasive changes in the bron- 
chi from the high nsk subjects needs to be clarified through 
longitudinal, prospective smdies with reference to histological 
chances in the bronchi. Ongoing longitudinal studies with flu- 
orescence bronchoscopy ami multiple biopsies with histology 
and other biotnatkers will define the ability of these markers to 
assess for risk 

(<•) What is the paihology.biologv of the small, often pe- 
ripherally located, tumors I? nun in diameter), which are more 



often diagnosed wilh newer radiological techniques (e.g„ low- 
dose spiral CT)? 

(/) Optimization of the tissue procurement and processing 
. techniques are important. Distinction of reactive from neoplastic 
processes is usually straightforward, but diagnostic difliculties 
may arise in the case of (a) inadequate or poorly prepared 
histological material to evaluate and (b) the presence of cyto- 
logjcal atypia in epithelium stimulated by inflammation, viral 
infection, radiation, or chemotherapy. 

{g) DNA array analyses of gene expression: will it be 
useful? How to collect proper mRNA? Can mRNA extracted 
from microdissected cells obtained at bronchoscopy be globally 
amplified and still remain representative of RNA present in situ? 

Biology of Lung Carcinogenesis and Potential Early 
Detection Markers 

Lung cancer is the end- stage of multiple- step carcinogen- 
esis, in most cases driven by genetic and epigenetic damage 
caused by chronic exposure to tobacco carcinogens. The genetic 
instability in human cancers appears to exist at two levels: at the 
chromosomal level, including large scale losses and gains; and 
at the nucleotide level including single or several base changes 
(29). Lung cancers harbor many numerical chromosomal abnor- 
malities (aneuploidy) and structural cytogenetic abnormalities 
including deletions and nonreciprocal translocations (30). At 
least three classes of cellular genes are involved: proto-onco- 
genes, TSGs, and DNA repair genes. Oncogenic activation often 
occurs via point mutations, gene amplification, or chromosomal 
rearrangement, whereas TSGs are classically inactivated by the 
loss of one parental allele combined with a point or small 
mutation or aberrant methylation of a target TSG in the remain- 
ing allele. Additionally, dysregulaled gene expression (either 
increased or decreased expression) can occur by other, as yet 
unknown, mechanisms (30). Present studies have not yet con- 
firmed a prominent role for abnormalities of DNA repair genes 
in lung cancer. 

Preneoplastic cells contain several molecular genetic ab- 
normalities identical to some of the abnormalities found in overt 
lung cancer cells (Fig. 2). These include allele loss at several 
loci (3p, 9p, Sp, and f7p), myc and ras up-regulation, cyclin Dl 
ovcrexpression, p53 mutations, and increased immunoreactiv- 
ily, bcI-2 ovcrexpression and DNA aneuploidy (31-35). Allclo- 
typing of precisely microdissected. preneoplastic foci of cells 
suggests that the earliest changes in the bronchial epithelium is 
allele Joss at chromosome regions 3p, then 9p, 8p, I7p ; 5q r and 
then ras mutations (36-39).Thc biological meaning of LOH is 
only vaguely understood. Recent evidence suggests that LOH 
may be a consequence of mitotic recombination, that there is 
only infrequent physical loss of genetic loci, and that LOH 
probably precedes chromosomal duplication (40). Allelic loss 
would thus be significant primarily in the presence of mutation 
in the retained allele, and gene dosage would not be expected to 
exert a phenotypic effect in LOH. Some reports have indicated 
thai ras activation occurs at early carcinoma stages (34). His- 
tologically normal bronchial epithelium adjacent to cancers has 
also been shown to have certain genetic losses. Atypical ade- 
nomatous hyperplasia, the potential precursor lesion of adeno- 
carcinomas, often have Kiras mutations |4I), 
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Fig. 2 Sequential changes during lung cancer pathogenesis. Although multiple genetic markers are abnormal in lung cancers, ihcir appearance during 
the lengthy preneoplastic process varies. The liming of the appearance of these changes has been investigated in bronchial prcneoplasia, because 
sequential sampling of the peripheral lung is technically difficult. Several alterations have been described in histologically normal bronchial epithelium 
of smokers. Other changes have been detected in slightly abnormal epithelium (hyperplasia, metaplasia) which we do not consider to be true 
premalienant lesions. These changes are regarded as early changes. Molecular changes detected frequently in dysplasia are regarded as intermediate 
in timing, whereas those usually detected at the CIS or invasive stages are regarded as late changes. It should be stressed that although there is a usual 
order, exceptions regarding the liming of onset may occur. Some changes are progressive, such as chromosome 3p deletions. Thus small discrete 
changes are present eaily. ptogressively become more extensive during pathogenesis, and frequently involve all or almost all of the arm in CJS 
samples. Although allelic loss at the TP53 locus may precede the onset of mutations, data on this sequence are scant. Dysregulatjon of the RNA 
component of telomerase (with its appearance in nonbasa) cells) is an early event, whereas up-regulation of the gene is a relatively late event. 



Molecular chances have been found nol only in the limes 
of patients with lung cancer, but also in the lunus of current and 
former smokers without lung cancer (JS. 42. 43). These obser- 
vations ate consistent with the multisiep model of carcinogen- 
esis and "field ranceiizalion" process, whereby the whole region 
is icpcatcdly o.posed to carcinogenic damage (tobacco smoke) 
and is at risk for developing multiple, separate, clonal!)' unre- 
lated foci of neoplasia. The widespread aneuploidy that occurs 
throughout the respiratory tree of smokers supports this theory 
(44). However, the ptesence of the same somatic p53 point 
mutation at widely dispersed preneoplastic lesions in a smoker 
without invasive lung cancct indicates that expansion of a single 
progenitor clone may spread throughout the respiratory tree 
|4S). These moleoilai alterations might thus be important 
tat gets lot use in the caily detection of lung cancer and for use 
as suriegaie biomaifcers in the follow-up of chemoprevcniion 



studies. Detection of these mutant cells should be possible with 
the different molecular techniques in accessible specimens. The 
prospects of diagnosing these biological abnormalities in mul- 
tiple types of clinical specimens are discussed below. 

Specimens for Clinical Testing: Sputum 

Since the 1930s, cytological examination of sputum has 
been used for the diagnosis of lung cancer (46). Cytological 
examination of sputa, especially multiple samples, is helpful for 
the detection of central rumors arising from the larger bronchi 
[e.g.. squamous cell- and small cell carcinomas). Exfoliated 
cells from peripheral tumois. such as adenocarcinomas, arising 
from the smaller airways (small bronchi, bronchioles., and alve- 
oli), especially those less than 2 cm in diameter, can be detected 
only occasionally in sputum samples. This has become of 
greater importance because the changes in cienrctte exposition 
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(filters and decreased nicotine content) have created an increase 
in adenocarcinomas and a decrease in squamous carcinomas 
(47-49). The sensitivity of sputum cytology for early lung 
cancer is only in the 20%-30% range from screening studies, 
but by adhering to proper specimen collection, and processing 
and interpreting criteria, the yield can be substantially improved 
(50, 51). The data on the reliability of the sputum are conflicting 
(52-54). Browman et at. (52) reported interobserver agreement 
of 68% for exact and 82% for within - 1- category. Holliday et 
al (54) reported low agreement within observers (27- 60%) and 
across observers (13-50%). Within - 1 - category intraobserver 
agreement underwent a two- or 3- fold increase in agreement, 
which was also the case for interobserver agreement. The var- 
iation in intra- and interobserver agreement seems to depend on 
experience among the cytotechnicians/cytopathologists and the 
composition of categories studied. A higher degree of agreement 
is obtained for higher grades of dysplasia (54). Risse et al (55) 
showed that the ability to detect premalignant conditions is 
dependent on the number and type ofcells present in the deeper 
airways, suggesting a mode of improvement that is unrelated to 
observer reliability. MacDougalJ et al. (56) concluded that spu- 
tum cytology was loo insensitive and insufficiently accurate to 
be included in the routine work-up of any patient suspected of 
having lung cancer. To improve the reliability of sputum cytol- 
ogy examinations a simplification of ihe diagnostic categories 
from 6 (normal; squamous metaplasia; mild, moderate, and 
severe afypia; and carcinoma) to 2-3 categories have been 
proposed (54). Future clinicopathological studies will be re- 
quired to validate this concept. 

To improve the sensitivity of sputum examination as a 
population-screening tool for the detection of early lung cancer, 
several approaches are currently under development. 

Jmmnnostaining. Annual sputum specimens obtained 
from individuals screened at Johns Hopkins were obtained, and 
the patients were monitored for 8 years (57). Because the 
clinical outcome of these patients wds known, archival sputum 
specimens were screened for the presence of biomarkers that 
could indicate the presence of lung tumors in an early, preinva- 
sive stage. In an attempt to distinguish the pattern of marker 
expression Tockman et al. (58) studied two monoclonal anti- 
bodies. Positive staining predicted subsequent lung cancer ap- 
proximately 2 years' before clinical recognition of the disease, 
with a sensitivity of 91% and a specificity of 38% (58). One of 
these antibodies (703 D«J) had a higher sensitivity and was later 
identified ns recognizing hnRNP A2/I31 (59). The Hole of 
hnRNP A2/B1 oveiexpression for detecting preclinical lung 
cancer has been studied in a laise high- risk population including 
6000 Chinese tin miners who were heavy smokers and who had 
an extraordinary rate of lung cancer irjfj). The results from. this 
study indicated that detection of hnRNP A2/BI oveiexpression 
in sputum epithelium cells was 2- to 3 fold more sensitive for 
detection of lung cancer than standaid chest X-iay and sputum 
cytology methods. The method was particularly effective in 
identifying early disease (60). 7 he sensitivity was 74% versus 
21% lor cytology and 42% for chest X-ray. However, the 
biomaiker had a lower specificity (70%) compared with cytol- 
ogy (100%) and chest radiograph (90%). An ongoing clinical 
trial is evaluating the performance of the A2/I3I protein as a 
biomaiker for the early detection of SPI.0. The patients at risk 



for SPLC have the highest incidence of lung cancer (2-5%) 
among asymptomatic populations (61). In this trial, 13 SPLCs 
were identified by A2/B1, and the sensitivity and specificity 
were 77-82% and 65-81%, respectively. Among the cases 
. identified as positive by immujiocylochemistry and image cy- 
tometry, 67% developed SPLC within J year (62). Whereas the 
previous immunocytochemistry studies on material from the 
older screening material from the NCI-supported screening 
studies were made on sputum cells cytologically classified with 
moderately or gravely atypical metaplastic appearance, the latter 
studies have been done on cytologically "normal appearing'* 
cells. More recently Sueoka et al (63) reported the confirmation 
of the value of overexpression of hnRNP A2/BI to detect 
preclinical lung cancer in Japan. Efforts to improve the sensi- 
tivity of hnRNP markers are ongoing (64). 

PCR Techniques. PCR techniques have been used for 
the evaluation of molecular biomarkers for early lung cancer 
detection. In a pilot study with selected patients from the Johns 
Hopkins Lung Project (JHLP), 8 (53%) of 15 patients with 
adenocarcinoma or large cell carcinoma were detected by mu- 
tations in sputum cells from I to 13 months before clinical 
diagnosis (65). However, the method seemed to be less sensitive 
than the protein marker described above, and the identification 
of specific gene abnormalities is further limited by the need to 
know the specific mutation sequence with which to probe Ihe 
sputum specimens. Currently, this approach is not practical for 
screening undiagnosed individuals. Future advances in gene 
chip technology may permit testing for all possible mutations of 
common oncogenes and TSGs in clinical specimens of asymp- 
tomatic individuals (62). 

Microsatellile markers are small repeating DNA sequences 
found in the noncoding regions of a gene. PCR amplification of 
these repeat sequences provides a rapid method for assessment 
of LOH and facilitates the mapping of suppressor genes (66, 
67). Microsatellile alterations are extension or deletions of these 
repeated elements. Detection of microsatellile alterations in 
histological or cytoJogical specimens may facilitate the detec- 
tion of clonal preneoplastic or neoplastic cell populations. Al- 
though the detection of microsatellile alterations does not indi- 
cate the specific genetic change in the tumor, detection of clonal 
cell populations might serve as a cancer screening marker (65). 
Identical alterations have been found in lung cancers and cor- 
responding spunim samples demonstrating minimal atypia (68). 
Ihe pJ6 gene is located on the short arm of chromosome 
9(9p21) and is frequently mutated or inactivated in tumors and 
cell lines derived from lung cancer (69. 70). Belinsky et al (71) 
measured hypcnncthylaijon of the CpG islands in the sputum of 
lung cancer patients and demonstrated a high correlation with 
early stages of non-small cell lung cancer, which indicated that 
pi 6 CpG hypermethyl3tion could be useful in the prediction of 
future lung cancer. However, prospective studies are needed to 
evaluate the role of pi 6 hypcnnethylation as a marker for early 
lung cancer detection. Multiple other genes are inactivated by 
hypermeihylation in lung cancer (72). and the detection of 
hypermethylation may be useful for risk assessment and early 
diagnosis. 

Computer-assisted lma«e Analysis. Computer- assisted 
image analysis was initially used to detect malignancy-associ- 
ated changes leg., subvisnal or nonobvious changes in the 
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distribution of DNA in the nuclei of histologically norma! cells 
in the vicinity of preinvasive or invasive cancer; 73). In a 
retrospective analysis of sputum cytology slides, malignancy- 
associated changes alone correctly identified 74% of the sub- 
jects who later developed squamous cell carcinoma (74). The 
technique has been improved, and recent data showed sensitiv- 
ities of 75% for stage 0/1 lung cancer and 85% for adenocarci- 
nomas with a specificity of 90% (75). This quantitative micros- 
copy technique allows the examining of thousands of cells per 
slide within a relative short time. Similar techniques have been 
approved in the United Stales for cervical cancer screening, and 
might, in the future, play a role for lung cancer screening. 
However, no prospective clinical studies has evaluated this 
technology in a larger lung cancer screening setting: 

High Throughput Technology. With future advances in 
£ene chip technology, it might become feasible to probe for 
expression of multiple genes in sputum specimens of asymp- 
tomatic individuals. However, this requires a large amount of 
undegraded RNA from respiratory tract cells. With the high 
throughput technology, a higher sensitivity might be achieved 
by using multiple markers at the cost of achieving a lower 
specificity, which would be undesirable for a screening study. 

In conclusion, we need to reevaluate the Tole of sputum 
cytology for screening and early detection of lung cancer be- 
cause of advances in biomarkers and technology. Ongoing stud- 
ies with standard and biomarker analysis in high-risk groups 
mighl change the previous negative attitude and provide a new 
perspective on sputum cytology as a mass screening tool when 
applied in a high-risk population. Adding different molecular 
diagnostic tests gives the possibility for early diagnosis far in 
advance of clinical presentation. However, validation of the 
tests in larger prospective studies is necessary, and the individ- 
ual tests have to be compared with each other to define the role 
of enrly diagnosis in the overall management of high-risk sub- 
jects. Furthermore, health economic issues have to be consid- 
ered. 

Specimens far Clinic of Testing: UAL 

BAL involves the infusion and rcaspiraiion of a sterile 
saline solution in distal segments of the lung via a fiberoptic 
bronchoscope. Ahrendt et al. (76) examined a seiies of 50 
resected non-SCLC rumor patients and compared the tumor and 
BAL with regard to molecular markers including p53 mutations, 
K-ras mutation, the mcthylation slams of the CpG island of the 
pI6 gene, and microsatellite alteration (Tables I and ?). With 
the possible exception of the. test for microsatellite alteration, all 
of the lest? had relatively high sensitivity and could detect 
mutant cells *in the presence of a large excess of normal cells. 
The frequencies of these changes in the rumors tanged from 
27% Ifoi K-ras mutations) to 56% (for p53 mutations). As 
ex per led. p53 mutations were mote frequent in central tpiedonv 
inanity squamous cell) tumors, and K-tas mutations were more 
frequent in peripheral (predominantly adenocarcinoma) tumors. 
The specificity was high t nearly 100%) because, with the ex- 
ception of microsatellite alterations, the same genetic change in 
BAL sample as in rumors was always found, but the sensiiuirv 
was low/ and in only 53% of tumors that contained molecular 
lesions were the same abnormalities detected in corresponding 
BAI. fluids. .Specifically, the tests were least helpful in the 



group of patients in whom improved diagnostic abilities are 
most needed, those with small, peripherally located tumors (77). 
Unfortunately, the investigators were not able to compare the 
molecular tests with routine cytopathological analysis of the 
BAL specimens. The sensitivity of the molecular tests in BAL 
specimens has to be improved, and we need to know the results 
from subjects at increased risk (current and former smokers 
without lung cancer and survivors of previous cancer of the 
upper respiratory tract) and subjects with chronic lung diseases 
as well as results from healthy never smokers. 

A European group has previously shown that genetic al- 
terations detected in DNA from bronchial lavage of individuals 
with lung cancer were also found in individuals with no evi- 
dence of malignant disease (78), which raises the question about 
the specificity of such molecular damage in neoplastic condi- 
tions. To improve the sensitivity and specificity of detecting 
allelic imbalance in lung rumors, high throughput PCR-based 
microsatellite assays have been established (79). In a recent 
study by Fielding et al (80), the up-regulalion of hnRNP A2/BI 
was found to be a promising marker in BAL for the detection of 
premalignant and malignant bronchial lesions with a diagnostic 
sensitivity of 96% and a specificity of 82%. 

It is too early yet to make conclusions as to whether BAL 
examinations will add to other pathological/molecular biologi- 
cal clinical studies. To obtain diagnostic materia) for BAL 
bronchoscopy is required, and we do not have any data that 
compare BAL examinations with biopsies. Thus, we do not 
know whether BAL is a valuable adjunct to the biopsies taken 
under the same bronchoscopy procedure. 

Specimens for Clinical Testing: Peripheral Blood 

For many years scientists have searched for a lung cancer- 
specific rumor marker that could be detected in peripheral blood. 
Optimism was raised in the "early" immunocytochemistry era 
by the use of monoclonal antibodies raised against more- or- less 
specific epithelial epitopes. In the search for epithelial cells in 
peripheral blood and bone marrow, monoclonal antibodies 
against cytokeratin have been used. However, these reactions 
are clearly not cancer-specific, and some antibodies have been 
shown to cross-react with norma) blood or bone marrow ele- 
riients (8), 82). Another explanation could be that cells from the 
macrophage/monocyte system may contain proteins derived 
from the primary rumor that have undergone necrosis and 
apoptosis and that these processed proteins are recognized by 
the antibodies (82). On the basis of "traditional" immunocyto- 
chemistry. no markers have been able to delect premalignant or 
earlv-malicnanl disotdcrs based on a peripheral blood sample. 
However, with the dcvclopmcnl of DNA technologies, new 
possibilities have been raised, and, wilh the use of PGR tech- 
niques, some promising reports have been published. 

Nanogram quantities of DNA circulating in blood ate pres- 
ent in healthy. indivrduals (S3. 84). Tumor DNA is also released 
into the plasma component in increased quantities (85 : So). 
Thus, the plasma and serum of cancel patients is enriched in 
DNA, an aveiaee four limes the amount of free DNA as com- 
pared with normal controls (87). In a study by Chen et al iS$>. 
a comparison of microsatellite alterations in tumoi and plasma 
DNA was done in SCLC patients, and 93% of »he patients with 
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Table J Tissues and other resources for the study of molecular markers 



Specimen 



Ref. 



Comments 



Tumor tissue 
Sputum 

Surrogate organ 

Bronchial brush/wash 
Bronchial tissues 



BAL fluids 



Blood components 



Tissue for molecular staging 



Tumor cell lines 



. Numerous 
» 

65, 68, 71, 74 
140 

141, 142, 143 

42, 43, 45, 144, 145 



72, 92, 149 



150, 151 



152, 153 



Mixture of cell types, may require microdissection (139). Extensively used 
for most, studies. Alcohol- fixed fine- needle aspirates may be used for 
mutational and other studies. 

Respiratory cells usually in small minority. Most samples fixed in 
Saccomanno's fixative. Several studies have been performed on these 
specimens many years later. 

Predominantly squamous epithelial cells. Buccal smears, brushihgs of 
tongue or tonsil may be explored as potential surrogate organs resulting 
from the field effect of tobacco damage of the entire upper 
aerodigestive tract. This concept needs to be confirmed. 

Predominantly respiratory cells. Fresh, frozen, or alcohol-fixed samples are 
suitable for multiple studies including FISH." 

Usually from bronchial biopsies, but may be obtained from surgical 
resection specimens. Formalin fixation and paraffin embedding required 
for histological diagnosis, although EASI preps may permit ' • 
identification and isolation of subpopulations. Paraffin sections may be 
used for genoryping polymorphisms, foi allelotypmg, and for in situ 
hybridization. 

BAL fluids are useful for examining the peripheral airway cells, which are 
the precursor cells of most adenocarcinomas. Numerous mononuclear 
cells present. Enrichment of epithelial cells desirable. 

Analysis of circulating tumor cells and genetic material shed by dying 
tumor cells into the plasma component may yield useful biological and 
diagnostic information. Gene mutations and presence of epithelial cell 
markers have been used to delect circulating tumor cells. Gene 
mutations, allelic loss, microsatelfitc alterations, and aberrant 
met hy tot ion have been used to identify tumor cell DNA released into 
the fluid compartment. 

Although lit lie data exists for lung cancers, regional lymph nodes, sentinel 
lymph nodes, and surgical resection margins have been used in other 
rumor types for molecular staging. 

Provide an unlimited self-replicating source of high-quality molecular 
reagents and have been used for numerous studies. Cell lines may or 
may not reflect the properties of the tumors from which they were 
derived (26), although they probably represent cellular subpopulations 
(27). Aggressive metastatic rumors are mote likely to be successfully 
cultured (28) resulting in skewed data. 



Culiuies of nonmalignant tissues 154. 155 



Epithelial cultures may be useful for studying molecular changes during 
multistage pathogenesis. Temporary as well as a few immortalized 
cultures from nonmalignant epithelial cells have been established. 
B-lymphoblastoid culiuies are useful for linkage analysis, for genetic 
suspcciibiliry studies, and for alleloryping corresponding rumors. 

Tissues such as buccal smears, tumoi-free lymph nodes, and peripheral 
blood mononuclear cells are useful as controls for linkage analysis, for 

genetic susceptibility studies, and for alleloryping corresponding tumors. 

FISH, fluorescence in situ hybridization; EASI. rpiihrlial aggregate separation and isolation. 



Nonmalignant tissue from patients 156, 157. 15$ 
and from cancer-free relatives 



microsatellite alterations in tumor DNA also had modifications 
in the plasma DNA. However, some patients had I OH only in 
the rumor DNA. Because most of the miciosatcllitc alterations 
were similar in tumor DNA and plasma DNA. they concluded 
that some of the DNA circulating in the blood comes fiom the 
rumor. Thus, modifications of circulating DNA can be used as 
an only detection marker. Detection of aberrant DNA meihyl- 
ation in scrum DNA in patients with non-SCLC has been 
repotted (72). Although the number of patients was small and 
the hvpermethylated DNA was found in all stages, it opens up 
for the possibility to be used as an early Jung cancel detection 
marker. Furthermore, p53 and ros gene mutations have been 



detected in the plasma and serum of patients with colorectal 
cancers (89-91), pancreatic carcinomas (92, 93). and hemato- 
logical malignancies (94). 

In conclusion, the limited direct accessibility of lung car- 
cinomas has led to efforts to identify rumor- associated soluble 
markers in serum oi plasma. Many of the currently recognized 
soluble markers were fust identified as "tumor' markers but. 
when evaluated in nonneoplastic tissue, have often been found 
m normal cells as well as in tumors. Foi early detection of lung 
cancer, we need more clinical data evaluating these new molec- 
ular biological markets from multiple sites, especially in high- 
risk groups. 
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Table 2 Molecular approaches for lung cancer investigation 



Specimen 



Ref. 



Comment 



Gene mutations 



Allelotyping 



159, 160, 16! 



18, 158 



Gene expression at RNA 
and protein level 



145, 162, 163.. 164, 165, 166 



Molecular cytogenetics 



Comparative genomic 
hybridization 

Morpbomctric studies 



40, 167. 163, I69 r 170 

171, 172 
74, 173; J 74 



Widely used technique, especially for p53 and ras genes. Often used 
to determine the r61e of a newly discovered gene m the 
pathogenesis of lung cancer. May be of diagnostic and prognostic 
significance. Multiple methodologies available. ... 

Useful as a partial substitute for mutational analysis and for . 
determining the chromosomal locations of putative rumor 
suppressor genes. Widely used to study multistage pathogenesis. 
Readily performed on formalin- fixed and microchssected tissues. 
Increasing use of genotyping using automatic sequencers. 

Northern blotting and reverse transcription-PCR are widely used to 
investigate gene expression. Western blotting often used for . - 
detection of protein expression, in siru hybridization for message 
expression can be performed on paraffin-embedded tissues and, 
thus, can be used to investigate multistage pathogenesis/ 
Microarray techniques offer promise of examining all or most of 
the genome but currently require relatively large amount of higb- 
quality RNA from purified cell populations. Sage technique useful 
for investigation and identification *of expressed genes. Similarly, 
advances in proteomics will permit simultaneous detection of 
multiple proteins. Numerous immunohistochemical studies of 
. oncogene expression have been used to study multistage 
pathogenesis. Of particular interest, hnRNP expression on 
exfoliated epithelial cells in sputum samples may predict for 
development of cancer. 

In situ hybridization studies of fixed materials or using smears has 
provided considerable information about numerical and structural 
changes. 

Useful for detection of gene amplifications. Less sensitive for the 
detection of regions of allelic loss. 

May be applied to paraffin-embedded tissues. Useful for determining 
aneuploidy and for measuring a number of nuclear and 
cytoplasmic parameiers. 



Specimens for Clinical Testing: Bronchoscopy 

WLB is the most commonly used diagnostic too) for ob- 
taining a definite histological diagnosis of lung cancer. Bron- 
choscopy has major diagnostic limitations for prem3lienant le- 
sions. Because these lesions ate only a few cells thick (0.2-1 
mm) and have a surface diameter of only a few millimeters, they 
rarely are observed as visual abnormalities. WooJner (95) re- 
ported that squamous cell CIS was visible to experienced bron- 
choscopists in only 29% of cases. To address this limitation, 
fluorescence bronchoscopy was developed. Early studies of 
fluorescence bronchoscopy entailed the use of fluorescent drugs 
(hematoporphyrin dyes) that were preferentially retained in ma- 
lignant tissue (96). Although, studies evaluating this approach 
did, in fact, show that eaily invasive and in situ cancers could be 
localized., the detection of dysplasia remained problematic (97- 
100). Furthermore, the development of phoiodynamic diagnos- 
tic systems was hampered by problems including skin photo- 
sensitizing and interference with tissue autofluorescence. To 
overcome these problems, n new laser phoiodynamic diagnostic 
system was developed (101). This system detected rumor- 
specific drug fluorescence at 630 nm wavelength, which is far 
from normal tissue antofluorescence (500-380 nm). and inter- 
ference bv aniofluoresecnce from normal tissue should then 
have been eliminated, but it remained a significant problem 
(102). 



Another approach was developed by Palcic et ah (103), 
who noticed the lack of autofluorcscence in the tumor lesions by 
using blue light (442 nm) rather than white light to illuminate 
the bronchial surface. They amplified the difference in autofluo- 
rescencc between normal, prcmalignant, and tumor tissue for 
clinical use (103. 10-1). Using a high-quality-charge coupled 
device and special algorithm, ihe LIFE was developed, taking 
advantage of the principle that dysplastic and malignant tissues 
reduce autofluoicscenl signals compared with norma) tissue 
(Fig. 3). 

Several studies have been performed comparing the diag- 
nostic specificity and sensitivity of LIFE bronchoscopy versus 
WLD in diagnosing preinvasive and early- invasive lesions 
(105-109; Table 3). Most of the studies reported a higher 
diagnostic sensitivity of LIFE bronchoscopy in the detection 
premalignant and early-malignant lesions at the cost of. lower 
specificity {i.e.. more false-positive results). In most of these 
studies, lesions with moderate dysplasia or worse were the target 
of the study and rated as "positive." The prevalence of prein- 
vasive and early lung cancer varies widely from one study to 
another, from 20.7% (105) to 65. 8% U02). The explanation 
might be beyond the risk profile of genetic variations or differ- 
ent levels of eNperience among the endoscopists as well as the 
pathologists involved. Furthennoic, there seems to be a liaininjz 
effect in using the LlhT: bronchoscope, which has been demon- 






14 Review: Advances in Early Detection of Lung Cancer 





Fig. 3 A, normal WLB ami normal LIFE bronchoscopy. B, WLB -shows inflammatory changes in ihe bronchial mucosa bui no suspicion of 
malignancy (left). LIFE bronchoscopy shows diffuse reduced autofluoresccncc r, visualized by diffuse red- brownish c»ieri:a:ion: arrows). Biopsy 
demonstrated diffuse severe dysplasia. 



strated by Venmans ei al. f J 07). In their study, the diagnostic 
sensitivity increased from 67 to 80% when comparing the first 
and the second half of the study. The use of the LIFE device in 
conjunction with WLB improved the deteclion rate of preneo- 
plastic lesions and OS significantly (Table 3). Kmie et at. { 1 06 j 
looked for more subtle tissue transformation, but their study 
included few patients with moderate dysplasia or worse. No 
improvement in the evaluation of metaplasia inde> was ob- 
served by the use of LIFE bronchoscopy. Thus, differences in 
the study population might explain the differ cm conclusion. 
There are still no clinical studies with sufficient long-term data 
showing thai modeiate dysplasia is the most relevant clinical 
predictor of eventual malignancy. Limitations in making con- 
clusions fiom the existing studies are also the potential meth- 
odological bias related to the ordei in which the diffetcnt bron- 
choscopy procedures are done and whether the same examine! 
has performed both procedures. To address these issues, a 



prospective randomized study between LIFE bronchoscopy and 
WLB was done al the University of Colorado Cancer Cenlcr. 
The study design included a randomization with regard to the 
order of procedure as well as the order of the individual bron- 
c hose opi st (109). Ihe order of the procedure and of the indi- 
vidual bionchoscopist did not affect the results. The study also 
demonstrated a significantly higher sensitivity in delecting pie- 
malignani lesions visualized by the LIFE, but at the cost of a 
lower specificity (J09). The reason for the low diagnostic spec- 
ificity found with the LIFE bronchoscopy in the different studies 
might be attributable to the visualization of moic abnormal foci 
with the LIFE bronchoscope, with the consequence that a larger 
mimbei ol biopsies were taken and. thus, there was a higher risk 
of more, false-positive tcsnlts. The use of LIFE bronchoscopy 
has led to the identification of a new morphological entity, the 
A SO. which is described above. In a tecent morphological study 
angiodysplastic changes wete frequently found m preneoplastic 
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Fig. 4 Seventy-one-year-old man with a spicular nodule in upper left 
lobe demonstrated on low-dose helical CT (picture), but not visible on 
chest X-radiography. CT- guided biopsy showed adenocarcinoma. 



and early- malignant lesions in the bronchi (26). The morpho- 
logical entity has been confirmed in preneoplasias among smok- 
ers, and the perspectives of this finding have been extensively 
discussed (110). The prognostic significance of this morpholog- 
ical entity is currently studied in ongoing long-term follow-up 
studies. Future studies have to evaluate the role of ASD as a 
biomarker for early lesions and whether it can be used as a 
marker for treatment effect or therapeutic target for chemopre- 
vention. 

The LIFE bronchoscope may play an important role in the 
screening and follow-up of subjects at high risk of developing 
lung cancer. At this stage, however, it is unknown whether the 
LIFE bronchoscope will lead to a reduction in lung cancer 
mortality. There are also no data on cost-effectiveness and 
cost-benefit analyses available for this new diagnostic proce- 
dure. The use of the LIFE bronchoscope may also in the future 
be extended to other indications, eg., patients staged as having 
resectable lung cancer on one side. Whether LIFE bronchoscopy 
of the contralateral lung will disclose abnormalities, which 
would change the therapeutic decision, is not yet reported. 

Recent Advances in Kadiology 

The previous NCI-sponsored screening trials failed to dem- 
onstrate any reduction in the lung cancer mortality bv sputum 
cytology and yearly chest radiography as mass screening tools 
for lung cancer screening. Limitations of design and execution 
of the studies, however, have been discussed extensively (8. 
Ill, 112). An extended follow-up (median. 20.5 years) of the 
Mayo Lung Project was recently published (113). There was 
still no difference in lung cancer mortality between the inter- 
vention ami and the control ami (4.4 versus 3.9 deaths per 1000 
person-years). However, the median survival for patients with 
resected early-stage disease was 160 years in the inter.crnion 
arm versus .VO years in the usual-care arm [F < (1.05). The hitter 
Undines have raised the question as to whether some small 
lesions with limireil clinical relevance may have been identified 
in the intervention ami. and the question of "overdiagnosis'' was 
discussed in accompanying editorials |1 1 4), 



Mass screening for lung cancer has been performed in 
Japan for many years and has been performed in over 500,000 
people in about 80% of the local communities (} )5). Sobue et 
ah (116) observed that annua) clinic-based chest X-ray screen- 
ing for lung cancer in Japan showed reduced fung cancer mor- 
tality by about one-fourth among individuals who. underwent 
screening once a year. In this screening program, the relative 
odds ratio of dying from lung cancer within 12 months was 
0.535 and in the 12-24-month period was 0:638 (1)7). How- 
ever, many studies have focused on the pitfalls in the detection 
of abnormalities by radiography (118-122). The limit of chest 
radiographic sensitivity for nodule detection is roughly I cm in 
diameter, by which time the tumor has over 10 9 cells and may 
already have violated bronchial epithelium and vascular epithe- 
lium. CT has been shown. to be more effective in the detection 
of peripheral lung lesions compared with plain radiography or 
conventional tomography of the whole lung (123, 124). 

Spiral CT scan is a relatively new technology with the 
ability to continuously acquire data resulting in a shorter scan- 
ning time, a lower radiation exposure, and improved diagnostic 
accuracy compared with those of plain radiography (125-127). 
Spiral CT allows the whole chest to be imaged in one or two 
breath- holds, reducing motion artifacts and eliminating respira- 
tory misregistration or missing nodules. Although there is 
greater radiation exposure with CT than with chest radiography, 
low-dose techniques (lower mA of 30-50 compared with 200 
for conventional CT) have achieved calculated exposure doses 
that are J 7% thai of conventional CT and 10 times that of chest 
radiographs. Further reduction in radiation dose while maintain- 
ing diagnostic accuracy is a topic of current research. Further- 
more, for the baseline screening, low- dose spiral-CT-scan i!v. 
contrast is not administered. Nodules as small as 1-5 mm can be 
shown with modern spiral CT technology (25, 128). The obvi- 
ous advantages with this new technology led some groups in 
Japan and in the United States to look to low-dose spiral CT as 
a tool for screening (Refs. 129-131; Tables 4 and 5). 

In a Japanese report, spiral CT scans and chest radiographs 
were done twice a year in 1369 individuals (129). Peripheral 
lung cancer was detected in 15 (0.3%) of 3457 examinations, 
and, among the 15 lung cancer cases detected, the results of 
chest X-ray were negative in 11(73%), and the tumors were 
detected only by low-dose spiral CT. The detection rates of 
low-dose spiral CT and chesl X-ray were 0.43% (15 of 3457 
examinations) and 0.12% (4 of 3457 examinations), respec- 
tively. Furthermore, 14 (93%) of the 15 lung cancers were stage 
1 disease. The histology showed that J I of the 15 lung cancer 
cases were adenocarcinoma, and 1 had squamous cell carci- 
noma. The effective exposure dose wiih spiral CT scan in that 
study was calculated to about one- sixth thai of conventional CT. 

The ELCAP in New York was designed to determine: {a) 
the frequency with which nodules were detected: (/>) the fre- 
quency with which detected nodules represent malignant dis- 
ease, and ic) the frequency with which malignant nodules are 
curable (131). In the ULCAP study. 27 lung cancers were found 
among 1000 subjects screened. Among the 27 patients with 
cancer. 85% had stage I disease (Table 5). 

Another population- based study on low- Jose CT screening 
has been published by Sdne et a\. (130). using □ mobile low- 
dose spiral CT scanner. The detection rale was 0.48% [i.e., 4-5 
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Table 3 Bronchoscopy versus WLB in diagnosing premaHgrianl and early-malignant lesions 



Sensitivity 



Specificity 



Predictive vahies 



Author 



No. of LIFE+ 
WLB 



Relative 
, sensitivity LIFE+ 
LIFE WLB L1FE+WLB WLB 



Relative 
specificity 
LIFE WLB LIFE+WLB 



PPV fl NPV 
LIFE+ LIFE+ 
WLB WLB 



PPV NPV PPV NPV 
LIFE LIFE WLB WLB 



Lam et aL (105) 700 
Kurieera/. fr (i06) 234 
Venmans er «/. (107) 139 
Vermulen er a/. (108) 172 
Kennedy e/n/.(109) 394 



0.67 NR 0.25 63(2.7/ 

NR 0.38 NR NR 

NR 0.89 0.78 L43 

0.93 NR 0.25 3.75 

0.79 0.72 0.18 4.4 



0.66 NR 0.90 

NR 0.56 NR 

NR 0.6) 0.88 

0.21 NR 0.87 

03 0.43 0.78 



NR 0.33 0.89 NR NR 0.39 0.83 

NR NR NR 0 )6 0.8) NR NR 

NR 0.20 NR 0.14 0.99 032 0.98 

NR 0.13 0.96 NR NR 0)9 0.90 

038 0.21 0.85 0.25 0.87 0.)7 0.80 



' PPV, positive predictive value; NPV, negative predictive value; NR, not reported. 
b Based on reference pathologist. 
* If invasive carcinoma is included. 



Table 4 Results from three population-based screening studies with low-dose spiral CT (LPCT) 





No. of individuals 
studied 


True 
positive n 


False 
positive" % 


Predictive 
value % 


Detection rate % 




Age incl. 

w 


Authors 


LDCT 


. X-ray 


Pack-yr 


Kanekoe/o/- (129) 
Sone et ai (130) 
Henschke ci.aL (131) 


1369 
3967 
11)00 


15 
19 

27 


15.6 
5.0 
20.) 


6.6 
8.8 
11.6 


0.43 
0.46-0.5 
2.7 


0.12 
0.70 


>20 
>30* 
>W 


>50 
40-74 
>60 



° Defined as individuals with "test-positive," in whom further workup gave no suspicion of malignancy. 
h The study also included a group of nonsmokers. 
e Average = 45 (not reported in ihc other studies). 



Table 5 HisioJogy, stage, and size of primary lung cancer detected by low- dose spiral CT 





No. of cancers/ 
No. screened 


Histology % 




TNM % 






Size (mm) 




Author 


Adeno" Squam. Other 


1 


11 II) 


IV 


Average 


Range OO 11-20 


>2I 


Kaneko ei of. (129) 
Sone ei aL (130) 
Henschke et ul. (131) 


15/1369(1.1%) 
19/5483(0.3%) 
27/1000(2.7%) 


73 17 

63 5 32 

67 3 30 


93 
84 

85 


7 

4 1) 


16 


12 
17 


8-1S 

6-47 4 14 
15 8 


• 3 
4 



" Adeno, adenocarcinoma; Squam. ; squamous cell carcinoma; TNM, tumor-nodc-meiastasis. 



40-74, whereas ELCAP screened people at high risk, ages ^60. 
with a tobacco history of al least 10 pack-years. Thus, as 
expected, the risk of the population to be screened affects the 
rate of cancer detection. 

Questions remaining to be answered include: (a) what are 
(he diagnostic sensitivity and specificity of this procedure: and 
(b) does screening reduce lung cancer mortaliry? The spiral CT 
has not been as sensitive for small central cancers as it is for 
small peripheral cancers (J29, )31). Minute nodules of lung 
cancer that are near ihc threshold of delegability may be over- 
looked at spiral CT screening f 132). A prospective study of the 
diagnostic sensitivity of spiral CT has recently shown thai the 
diagnostic sensitivity exceeded the sensitivity of conventional 
CT in previous reports (25). However, there were limitations in 
the detection 'of intrapulmonary nodules smaller than 6 mm and 
of pleural lesions. Compared with surgery (thoracotomy with 
palpation of deflated .lung, resection, and histology), the sensi- 
tivity of spiral CT was 60% for intmpulmunary nodules of.<6 
mm and 95% for nodules of 2 6. mm and was 100% foi neo- 
plastic lesions 5 6 mm. Furthermore, a marked difference in the 
sensitivities of two independent observers was found for nod- 
ules smaller than 6 mm, whereas agreement was much better for 



cases per J 000 examinations). Surprisingly, there was no dif- 
ference in the detection rate among smokers (0.52%) versus 
nonsmokers (0.46%). The results from the three population- 
based studies are summarized in Tables 4 and 5. The conclusion 
from these studies is that 85% of the lung cancers detected by 
low-dose CT were in stage \, offering improved possibility for 
curative treatment and better prognosis in general. However, the 
issue of "false-positive" 7 scans has to be raken into consideration. 
Thus far. up to 20% of the participants with nodules on the scan 
had no malignancy during the follow-up period. The possibility 
that the cancers found represent incidental cancers as in the 
Mayo Lung Project must also be considered (1 14). The results 
from these studies confirm the expectation that low-dose CT 
increases the detection of small noncakifjed nodules and. that 
lung cancer at 3n eatlier and more curable stage are detected. 
The mobile CT screening srudy by Sone et nl. (130) showed that 
low-dose CT increased the likelihood of direction of malignant 
disease 10 times as compared with rndiogiaphy. The overall rate 
of malignant disease was lower in ihe Japanese studies (129. 
130) compared with the EL CAP study (Kef. 13); detection rates 
0.43-0.4S% versus 2.7%). This could be because the Japanese 
studies screened individuals from rhe general population ages 
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6-10-mm nodules (25). Given these promising preliminary clin- 
ical results, further research is needed to determine the optimal 
technique for spiral CT screening, which includes collimation, 
reconstruction interval, pilch, and viewing methods. Decreasing 
the slice thickness to 3 mm, monitoring the viewing of exami- 
nations, and computer-aided diagnosis have been used to im- 
prove the diagnostic capability of spiral CT in the detection of 
pulmonary nodules (J33-I36). 

Future large scale randomized studies have to confirm 
whether in fact spiral CT screening will lead to a reduction in 
Jung cancer mortality. In a randomized study, the following 
questions arise: (a) what is the optimal high-risk group to study 
and what should be the control arm? (6) what should be the end 
points (goals) of the studies? The ultimate goal is to reduce the 
lung cancer mortality. However, although this is a long- term 
goal, intermediate end points from such studies should be eval- 
uated. The change to more curable stages at diagnosis for the 
lung cancer patients is one such immediate goal; (c) what is the 
optimal workup and the morbidity of this program? (d) what is 
the cost of such a screening program? and (e) what is the 
false-positive rate of the screening findings? Incorporation of 
smoking cessation programs should be included in the future 
design of screening studies because it has been shown that 
screening with low- dose CT in participants who are still smok- 
ing provides substantial motivation for smoking cessation (137). 

The studies with spiral CT-scan have demonstrated the 
superior diagnoslic ability in the detection of small peripherally 
located tumors, mosi of the malignant ones of adenocarcinoma 
type of histology. The diagnoslic sensitivity of spiral CT for 
more centrally located nrmors (mostly squamous cell carci- 
noma) is significantly lower than for the peripherally located 
ones. Through these spiral CT studies, we will learn about the 
biology, pathology, and clinical course of these small tumors, 
which might be different from what we know about clinically 
more evident tumors detected routinely in previous studies. 

Because lung cancer is so common, the introduction of any 
new screening technique in this area has to be underpinned by- 
careful definition of the cost implications and must be justified 
by compelling evidence. The cost effectiveness of the spiral 
CT approach should be assessed by evaluating the rate of 
over-diagnosing nonmalignant, relatively common abnormali- 
ties and comparing CT imagine to other diagnostic technologies. 

PET with FOG has recently emerged as a practical and 
useful imaging modality in the preoperative staging of patients 
with lung cancer. However, whereas CT is most frequently used 
to provide additional anatomical and morphological information 
about lesions, the FDG PUT imaging provides physiological and 
metabolic information that characterizes lesions that are inde- 
terminate by CT. FDG PI;T imaging lakes advantage of the 
increased accumulation of FDG in transformed cells and is 
sensitive (-95%) for the detection of cancer in patients who 
have indeterminate lesions on CT ( 138). The specificity (-85%) 
of PET imaging is slightly less than its sensitivity because some 
inflammatory processes avidly accumulate FDG. The high neg- 
ative predictive value of PF.T suggests that lesions considered 
negative on ihe study are benign, biopsy is .not needed, and 
radiographic follow- up is recommended. Several studies have 
documented rhe increased accuracy of PET compared with CT 
in the evaluation of the hilar and mediastinal lymph node status 



in patients with lung cancer (J 38). However, the PET resolution 
is sufficient only for nodules ^6 cm and will not be helpful in 
detecting the very small nodules. Compared with low-dose 
spiral CT, the FDG PET scan is more expensive and time 
consuming. The role of PET scan in early diagnosis of |ung 
cancer in an asymptomatic high-risk population is not yet eval- 
uated. However, future studies have to include PET evaluation 
to define its role in a population screening setting. 

Conclusion 

Recent advances in molecular biology and pathology have 
led to a better understanding and documentation of morpholog- 
ical changes in the bronchial epithelium before development of 
clinical evident lung carcinomas. Combined with technical de- 
velopments in radiological and bronchoscopic techniques, these 
procedures offer great promise in diagnosing lung cancer, far in 
advance of clinical presentation. Any of these individual proce- 
dures could be incorporated into the routine management of 
individuals at risk for developing primary or secondary lung 
cancer, and for several of these methods, clinical studies are 
under way. Preliminary reported data are very promising for the 
early detection of lung cancer. Future studies must incorporate 
the different methods in a multidisciplinary scientific setting to 
evaluate the role of the individual method in the overall man- 
agement for individuals at high risk for developing lung cancer. 
Several of these tests might diagnose the disease at the stage of 
clonal expansion before invasive carcinoma has developed. A 
management and intervention strategy appropriate to that stage 
of disease have lo be developed. Preliminary snidies of chemo- 
prevention agents are reported, and new agents based on other 
biological mechanisms are under development and ready for 
clinical trials. It is now time to plan clinical trials that evaluate 
both diagnostic and therapeutic approaches to access their im- 
pact on the incidence of clinical lung cancer. 
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